You create S3 bucket and folder which is used as the source and destination data location when you process data using EMR cluster.
-
Login to AWS Management Console and change the region to Ireland.
-
Goto S3 Management Console. Create a bucket with the name dojo-data. If the bucket name is not available, create bucket with a name which is available. In this bucket, create three folders - input, output and script.
-
The input folder is used as the input location for the data to be processed by the EMR cluster. The output folder is the destination location where EMR cluster will write the processed data output. The script folder is used to store the script when creating a task within the EMR cluster.
-
Download customers.csv file from the link. Upload the customers.csv file to the input folder.
-
Open the customers.csv file to get familiar with the data you will work with. It is a sample data. The S3 configuration is ready. In the next step, you launch EMR cluster.