Using AWS Glue Workflow

   Go back to the Task List

  « 5. Create Glue Crawler    7. Create and Run Glue Workflow »

6. Create Glue Job

You create a Glue Job which will read customers.csv data using the Glue Catalog and write to the target folder in the dojo-data S3 bucket. If you created bucket with a different name; then it will write there.

  1. Goto Glue Management console. Click on the Jobs menu in the left and then click on the Add job button.

    AWS Glue

  2. On the next screen, type in dojojob as the job name. Select dojo-glue-role for the IAM Role. Select Spark for type. Select A new script to be authored by you option. Select s3://dojo-data/scripts location for S3 path where the script is stored and Temporary directory fields. If you created bucket with a different name then replace dojo-data part with that name. Finally, click on the Next button.

    AWS Glue

  3. On the next screen, click on the Save job and edit script button.

  4. It will save the job and open the job editor. Copy-paste the following code for the job.

    AWS Glue

    import sys
    from awsglue.transforms import *
    from awsglue.utils import getResolvedOptions
    from pyspark.context import SparkContext
    from awsglue.context import GlueContext
    from awsglue.job import Job
    
    glueContext = GlueContext(SparkContext.getOrCreate())
    
    customerdf = glueContext.create_dynamic_frame.from_catalog(
                 database="dojodb",
                 table_name="customers")
    
    glueContext.write_dynamic_frame.from_options(customerdf, connection_type = "s3", connection_options = {"path": "s3://dojo-data/target"}, format = "json")
    

    `

  5. If you created the S3 bucket with a different name, replace dojo-data with that name. The code above is simple. You create Glue Context using SparkContext. You then create customerdf Dynamic Frame from the customers table in the dojodb database. You then simply convert the format of the data from csv to json and write back to the target folder in the S3 bucket.

  6. You might be wondering that there is no customers table created in dojodb database as of now but you are using them in the job code. You should not worry for that because in the Glue Workflow, you will first run the crawler and then the job. That way, when the job runs, the customers table would have been created in dojodb database.

  7. Click on the Save button to save the job script and then close the job editor.

    AWS Glue

  8. The job and crawler are ready. It is time to orchestrate them in a Glue Workflow.