Using AWS Glue ETL Job with Streaming Data

   Go back to the Task List

  « 6. Create IoT Rule    8. Create Glue ETL Job »

7. Create Database and Table

For the ETL side of configuration, let’s start with database creation. The database will have the catalog table which represents the data inside Kinesis data stream. Then we write Glue ETL job which takes data from the catalog and write to the S3 bucket after transformation.

  1. Goto the AWS Lake Formation console. Click on Databases menu in the left and click on the Create database button.

    Glue

  2. On the next screen, type in dojodatabase as the database name and select the option Use only IAM access control for new tables in this database. Then click on the Create database button. For this exercise, we want to use IAM role based access to the database and catalog table. However - it is also possible to use Lake Formation based security for this scenario.

    Glue

  3. The database is created in no time. The next task is to create the table. Goto AWS Glue Console. Click on the Tables menu in the left and then click on Add table manually option under the Add tables menu.

    Glue

  4. On the next screen, type in the table name as dojotable, keep database selected to dojodatabase and then click on the Next button.

    Glue

  5. On the next screen, select Kinesis as the source type. Type in stream name dojostream and type in https://kinesis.eu-west-3.amazonaws.com as the Kinesis source URL. The workshop uses eu-west-3 as the region code because the workshop used the Paris region. If you are using a different region, use region code for that in place of eu-west-3. Click on the Next button.

    Glue

  6. On the next screen, select JSON as the classification because the Kinesis data stream will have JSON messages from the IoT Device. Click Next.

    Glue

  7. The device will send JSON messages to the Kinesis Data Stream. The format of the message is shown below. You now configure the schema of the catalog table to match it. On the Define a schema page, click on the Add column button.

    Glue

message format

{
    "sensor": "",
    "temperature": 0,
    "vibration": 0
}
  1. On the Add column popup, type in sensor as the column name, select string as the column type, type in 1 for the column number and then click on the Add button.

    Glue

  2. Similarly, you add two more columns temperature and vibration, both int type and their respective column numbers as 2 and 3. Then click on the Next button.

    Glue

  3. On the next screen, click on the Finish button. The table is created in no time. The next task is to create the Glue ETL job which will read streaming data from Kinesis and write to the S3 bucket after transformation.