Important Note: You will create AWS resources during the workshop which will incur cost in your AWS account. It is recommended to clean-up the resources as soon as you finish the workshop to minimize the cost.

Getting Started with Amazon EMR

Amazon EMR is a big data platform for processing large scale data using open source tools such as Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi, and Presto. Amazon EMR is easy to set up, operate, and scale for the big data requirement by automating time-consuming tasks like provisioning capacity and tuning clusters.

In this workshop, you launch an EMR cluster. You then use Jupyter Notebook to do PySpark based programming with EMR Cluster. Finally, you launch a data processing task using EMR Cluster.

The following diagram shows the scenario you are going to build. Start the workshop

Amazon EMR