Getting Started with Amazon EMR

   Go back to the Task List

  « 4. Launch Jupyter Notebook    6. Running Task »

5. Coding in Notebook

You now run PySpark code to process S3 based data in the EMR Cluster.

  1. In the EMR Management console, on the dojocluster notebook page, click on the Open in Jupyter button.

    Amazon EMR

  2. It will open Notebook environment in a new browser tab or window. In Jypyter environment, click on PySpark option under the New menu.

    Amazon EMR

  3. It will open Jypyter notebook IDE. It the cell, copy-paste the following code and run it. The code imports libraries for the PySpark.

    Amazon EMR

    import sys
    from datetime import datetime
    
    from pyspark.sql import SparkSession
    from pyspark.sql.functions import *
    

    `

  4. You copy-paste and run the following code to get the spark session.

    Amazon EMR

    spark = SparkSession\
        .builder\
        .appName("SparkETL")\
        .getOrCreate()
    

    `

  5. You copy-paste and run the following code to read customers.csv data from the s3 bucket and populate into customerdf dataframe. If you created bucket with a different name, then use that bucket name. When reading the data, you are considering to infer the schema and also treating the first row as the header.

    Amazon EMR

    customerdf = spark.read.option("inferSchema", "true").option("header", "true").csv("s3://dojo-data/input/customers.csv")
    

    `

  6. You copy-paste and run the following code to check the schema of the customerdf dataframe.

    Amazon EMR

    customerdf.printSchema()
    

    `

  7. You will perform a small transformation where you select only CUSTOMERNAME and EMAIL fields from the dataframe. Copy-paste and run the following code to apply this transformation and check schema of the transformed dataframe.

    Amazon EMR

    customerdf = customerdf.select("CUSTOMERNAME","EMAIL")
    customerdf.printSchema()
    

    `

  8. Finally, you want to write the transformed dataframe back to the S3 bucket in the output folder. Copy-paste and run the following code to write the transformed data to the S3 bucket in parquet format. If you created bucket with a different name, then use that bucket name.

    Amazon EMR

    customerdf.write.format("parquet").mode("overwrite").save("s3://dojo-data/output/")
    

    `

  9. The code has written the output in the S3 bucket. You can verify it by navigating to the output location in the S3 bucket.

    Amazon EMR

  10. This was an example to see how you can run PySpark code in Jypyter Notebook to perform data transformation. The notebook is mostly used for development purpose. In the next step, you run the same code using EMR Task. EMR Task is a method used in the production.