Data aggregation is also a very important aspect of data transformation during the ETL process. In this task, you will learn to apply aggregation and statistical methods on the data.
-
On the AWS Glue console, open jupyter notebook if not already open. On jupyter notebook, click on New dropdown menu and select Sparkmagic (PySpark) option. It will open notebook file in a new window. Rename the notebook to aggregate.
-
Copy and paste the following PySpark snippet (in the black box) to the notebook cell and click Run. It will create Glue Context. It takes some time to start SparkSession and get Glue Context. Wait for the confirmation message saying SparkSession available as ‘spark’..
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
glueContext = GlueContext(SparkContext.getOrCreate())
- Run the following PySpark code snippet which loads data in the Dynamicframe from the sales table in the dojodatabase database. It also converts the Dynamicframe to Dataframe and uses show method to display the top 20 rows.
salesDF = glueContext.create_dynamic_frame.from_catalog(
database="dojodatabase",
table_name="sales")
salesDF = salesDF.toDF();
salesDF.show()
- You can view two columns in perspective using crosstab method. The method is generally used with categorical columns or features. Run the following PySpark code snippet to view the relationship between productline and dealsize columns.
salesDF.crosstab('productline', 'dealsize').show()
- You can use describe method to see statistics of a particular column. The method is generally used with numeric columns or features. Run the following PySpark code snippet to check statistics of the sales column.
salesDF.describe('sales').show()
- You can also use groupby method with a particular column along with individual aggregation function. Run the following PySpark code snippet to see count per productline.
salesDF.groupby('productline').count().show()
- Run the following PySpark code snippet one by one to see min,max,mean and sum of quantityordered per productline.
salesDF.select("productline","quantityordered").groupby('productline').min().show()
salesDF.select("productline","quantityordered").groupby('productline').max().show()
salesDF.select("productline","quantityordered").groupby('productline').mean().show()
salesDF.select("productline","quantityordered").groupby('productline').sum().show()
- You can also use multiple aggregation in one using groupby and agg functions. Run the following PySpark code snippet to view average quantityordered, sum of sales and maximum priceeach per productline.
salesDF.groupby('productline').agg({'quantityordered': 'mean','sales': 'sum','priceeach' : 'max'}).show()
- You can use orderBy method to sort Dataframe for a particular column in ascending or descending order. Run the following PySpark code snippet one by one to sort the Dataframe by sales first ascending or descending.
salesDF.orderBy(salesDF.sales).show()
salesDF.orderBy(salesDF.sales.desc()).show()
- You worked with single data in the workshop till this point. In the Next task, you learn about working with more than one data together. Click on the Close and Halt option under Files menu to close the notebook. Confirm if it asks to Leave Page.