Building AWS Glue Job using PySpark - Part:2(of 2)

   Go back to the Task List

  « 6: Merge & Split Data Sets    8: Clean up »

7: Write / Load Data at the Destination

You learn about data query and manipulation methods in the workshop so far. It really helps in transforming the data as part of the ETL process. In this task, you learn to write data at the destination. As an example, you will ETL data from s3 data source based data catalog to another S3 location within the data lake. Similar methods are used to load data to other destinations such as relational databases, Redshift etc. The workshop also provides some information about loading data to the relational / Redshift as the destination.

  1. On the AWS Glue console, open jupyter notebook if not already open. On jupyter notebook, click on New dropdown menu and select Sparkmagic (PySpark) option. It will open notebook file in a new window. Rename the notebook to loaddata.

  2. Copy and paste the following PySpark snippet (in the black box) to the notebook cell and click Run. It will create Glue Context. It takes some time to start SparkSession and get Glue Context. Wait for the confirmation message saying SparkSession available as ‘spark’..

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

glueContext = GlueContext(SparkContext.getOrCreate())
  1. Run the following PySpark code snippet which loads data in the Dynamicframe from the sales table in the dojodatabase database.
salesDF = glueContext.create_dynamic_frame.from_catalog(
             database="dojodatabase",
             table_name="sales")
  1. Create a small Dynamicframe productlineDF which you want to write back to another S3 location within the data lake. The productlineDF will have three columns productline,dealsize and status only.
productlineDF = SelectFields.apply(salesDF, paths = ["productline","dealsize","status"])
productlineDF.printSchema()
  1. You want to write back productlineDF Dynamicframe to another location in S3. Run the following PySpark code snippet to write the Dynamicframe to the productline folder within s3://dojo-data-lake/data S3 bucket. If you recall, it is the same bucket which you configured as the data lake location and where your sales and customers data are already stored. If you created bucket with a different name, then use that one. You also chose json as the file format which will convert the data from csv to json. You might have retained the format by choosing csv for the format; or choose other formats such as AVRO, PARQUET and ORC.
glueContext.write_dynamic_frame.from_options(productlineDF, connection_type = "s3", connection_options = {"path": "s3://dojo-data-lake/data/productline"}, format = "json")
  1. This was an example of ETL from S3 to S3. Similarly, you can do ETL with other data source as well such as relational databases and Redshift. For instance, the following code snippet will load productlineDF to a Redshift database which is connected using dojoconnection Glue connection.
glueContext.write_dynamic_frame.from_jdbc_conf(productlineDF, catalog_connection = "dojoconnection",
                                                   connection_options = {"dbtable": "products", "database": "dojodatabase"},
                                                   redshift_tmp_dir = "s3://dojo-data-lake/data/script")
  1. With this task, you learnt the fundamentals of ETL operations - Extract (query from the source), Transform (data manipulation) and Loading (writing at destination). The workshop is complete. Kindly follow the next task to clean-up the resources so that you don’t incur any cost post the workshop.