In the workshop, you learn about using PySpark to create Glue Job. PySpark is used to process and transform the data. Before you start the workshop, let’s understand the data to be configured in the data lake for the workshop.
You will configure data lake with two data sets - sales and customers. The data sets are stored in Amazon S3. The AWS Glue and AWS Lake Formation services are used to create the data lake.
The following are the schema of the data sets:
customers data set fields: {CUSTOMERID, CUSTOMERNAME, EMAIL, CITY, COUNTRY, TERRITORY, CONTACTFIRSTNAME, CONTACTLASTNAME}
sales data set fields: {ORDERNUMBER, QUANTITYORDERED, PRICEEACH, ORDERLINENUMBER, SALES, ORDERDATE, STATUS, QTR_ID, MONTH_ID, YEAR_ID, PRODUCTLINE, MSRP, PRODUCTCODE, DEALSIZE, CUSTOMERID}
Let’s start building now.