Using Amazon EMR with AWS Glue Catalog

   Go back to the Task List

  « 7. Launch Jupyter Notebook    9. Clean up »

8. Coding in Notebook

You now run PySpark code to process S3 based data in the EMR Cluster using AWS Glue Catalog.

  1. In the EMR Management console, on the dojoemrnotebook notebook page, click on the Open in Jupyter button.

    Amazon EMR

  2. It will open Notebook environment in a new browser tab or window. In Jupyter environment, click on PySpark option under the New menu.

    Amazon EMR

  3. It will open Jupyter notebook IDE. It the cell, copy-paste the following code and run it. The code imports modules for the PySpark.

    Amazon EMR

    import sys
    from datetime import datetime
    
    from pyspark.sql import SparkSession
    from pyspark.sql.functions import *
    

    `

  4. You copy-paste and run the following code to get the spark session.

    Amazon EMR

    spark = SparkSession\
        .builder\
        .appName("SparkETL")\
        .getOrCreate()
    

    `

  5. You copy-paste and run the following code to read data from the customers Glue Catalog Table and then show the data loaded into the dataframe.

    Amazon EMR

    spark.catalog.setCurrentDatabase("dojodb")
    df = spark.sql("select * from customers")
    df.show()
    

    `

  6. You copy-paste and run the following code to perform column select (for CUSTOMERNAME and EMAIL columns) transformation on the dataframe and then show the result.

    Amazon EMR

    df = df.select("CUSTOMERNAME","EMAIL")
    df.show()
    

    `

  7. Finally, run the following code to write the transformed data into the output folder in the S3 bucket. The data is also transformed from the csv to json format. If you created bucket with a different name then use that name here.

    Amazon EMR

    df.write.format("json").mode("overwrite").save("s3://dojo-lake/output/")
    

    `

  8. The code has written the output in the S3 bucket. You can verify it by navigating to the output location in the S3 bucket.

    Amazon EMR

  9. This finishes the workshop. Follow the next step to clean up the resources so that you don’t incur any cost post the workshop.