Using AWS Glue ETL Job with Streaming Data

   Go back to the Task List

  « 7. Create Database and Table    9. Publish data from the Device »

8. Create Glue ETL Job

You create AWS Glue ETL job in this task which reads data from the Kinesis data stream (using the Glue Catalog table) and writes to the S3 bucket after the transformation.

  1. Goto the AWS Glue console. Click on Jobs menu in the left and click on the Add job button.

    Glue

  2. On the next screen, type in dojojob as job name. Select dojogluerole as the IAM Role. Select s3://dojo-data-stream/script as the bucket location for the S3 path where the script is stored and Temporary directory fields. If you created bucket with a different name, then use that one. Keep rest of the fields as the default and click Next button.

    Glue

  3. On the next screen, select dojotable as the table and click on the Next button.

    Glue

  4. On the next screen, select Change schema option and click on the Next button.

    Glue

  5. On the next screen, select Create tables in your data target as the target option. Select Amazon S3 as the data store. Select CSV as the format. You are transforming the JSON data from the Kinesis data stream into CSV data in Amazon S3 bucket. Select None as the Compression type and select s3://dojo-data-stream as the target path. If you created bucket with a different name in the previous task, then use that bucket. Finally, click on the Next button.

    Glue

  6. On the next screen, keep the source to target column mapping as the default and click on the Save job and edit script button.

    Glue

  7. The job is saved. Click on the X icon on the top right of the job page to close the job definition.

    Glue

  8. You will be back to the job list screen. Select dojojob and click on the Run job option under the Action drop down.

    Glue

  9. It will show a pop-up to run the job. Click on the Run job button on the pop-up.

  10. The job execution will start. You can see the status of the job in the lower panel when you select the job.

    Glue

  11. You can see the Execution time column of the job run status shows 0 secs in the screen above. Wait till the execution time changes to more than one seconds (like 1 min) as shown in the screen below.

    Glue

  12. The Glue ETL job is running now and it is looking for data in the Kinesis data stream. Time to publish some data into the Kinesis data stream from the IoT device.