Using Amazon Redshift in AWS based Data Lake

   Go back to the Task List

  « 7. Developer Endpoint    9. Clean up »

8. PySpark Code to Access Data

The Jupyter notebook is ready for the development. Let’s write PySpark code to work with Redshift Data in the Data Lake.

  1. Goto the AWS Glue console, click on the Notebooks option in the left menu, then select the notebook and click on the Open notebook button.

    Glue

  2. On the next pop-up screen, click the OK button. It will open jupyter notebook in a new browser window or tab. In the notebook window, click on Sparkmagic (PySpark) option under the New dropdown menu.

    Glue

  3. It will open notebook file in a new browser window or tab. Copy and paste the following PySpark snippet (in the black box) to the notebook cell and click Run. It will create Glue Context. The Glue context connects with the Spark session and also provides access to the data lake catalog tables.

Glue

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

glueContext = GlueContext(SparkContext.getOrCreate())
  1. It takes some time to start SparkSession and get Glue Context. Wait for the confirmation message saying SparkSession available as ‘spark’..

Spark

  1. Next Run the following PySpark code snippet which loads data in the Dynamicframe dojodf from the dev_public_dojotable catalog table in the dojodb database. If you created S3 bucket with a different name, replace dojo-rs-bkt with that name for the temporary folder parameter.

Spark

dojodf = glueContext.create_dynamic_frame.from_catalog(
             database="dojodb",
             table_name="dev_public_dojotable", redshift_tmp_dir="s3://dojo-rs-bkt/scripts/")
dojodf.printSchema()
  1. Run the following PySpark code snippet which selects firstname and lastname fields from the dojodf dynamic frame to create a new dynamic frame dojodfmini.

Spark

dojodfmini = SelectFields.apply(dojodf,["firstname","lastname"])
dojodfmini.printSchema()
  1. Next Run the following PySpark code snippet to write dojodfmini data to the s3 bucket in the CSV format. If you created S3 bucket with a different name, replace dojo-rs-bkt with that name for the destination location parameter. Once you run the code, you can check the data in s3://dojo-rs-bkt/data folder.

Spark

glueContext.write_dynamic_frame.from_options(dojodfmini, connection_type = "s3", connection_options = {"path": "s3://dojo-rs-bkt/data"}, format = "csv")
  1. Next Run the following PySpark code snippet to write dojodfmini data to the Redshift database with the table name dojotablemini. Once you run the code, you can check the table dojotablemini in the dev database in the Redshift cluster. If you created S3 bucket with a different name, replace dojo-rs-bkt with that name for the temporary folder parameter.

Spark

glueContext.write_dynamic_frame.from_jdbc_conf(dojodfmini, catalog_connection = "dojoconnection",
                                                   connection_options = {"dbtable": "dojotablemini", "database": "dev"},
                                                   redshift_tmp_dir = "s3://dojo-rs-bkt/scripts/")
  1. You can use query editor in the Redshift cluster to check the dojotablemini table and its data.

Spark

  1. This finishes the workshop. Follow the next step to clean-up the resources so that you don’t incur any cost post the workshop.