Using Custom Transformation in AWS Glue Studio

   Go back to the Task List

  « 5: Configure Data Lake    7: Create Job in Glue Studio »

6: Configure and Run Crawler

One of the fundamental principle of building the data lake is that every data in the data lake should be catalogued. The catalog is automated using crawlers in AWS Glue. The crawler uses role based authorization to create catalog in the data lake database. You created an IAM Role dojocrawlerrole in the earlier task which the crawler will use to create data catalog in the database. You need to assign database permission for this role. After the permission configuration, you will create and run crawler to catalog the data.

  1. Open the AWS Lake Formation console, click on the Databases option in the left. You will see dojodb database listed.

  2. Select the dojodb database and click on the Grant menu option under the Action dropdown menu.

    AWS Glue Studio

  3. On Grant permissions screen, choose for the My account option and select dojocrawlerrole for the IAM users and roles field. Select only Create table and Alter permissions for the Database permissions. Then click on the Grant button. It means you are authorizing crawler role to be able to create and alter tables in the database.

    AWS Glue Studio

  4. After assigning permission, time to configure and run crawler. Open the AWS Glue console. Click on the Crawlers menu on the left and then click on the Add crawler button.

    Crawler Menu

  5. On the next screen, enter dojocrawler as the Crawler name and click Next.

    Crawler Name

  6. On the next screen, select Data stores as the Crawler source type and click Next.

  7. On the next screen, select S3 as a data store and select Specified path in my account option. Provide the include path for the S3 bucket as s3://dojo-customer-data/data/customers. If you created the bucket with a different name, then replace dojo-customer-data part with that name. Click Next.

    Crawler Data Path

  8. On the next screen, you are going to crawl only one data store, so select No from the option and click Next.

  9. On the next screen, select Choose an existing IAM role option and then select IAM role dojocrawlerrole from the dropdown list. Click Next.

    Crawler Role

  10. On the next screen, select Run on demand as the frequency and click Next. In actual production use, you generally schedule crawler run so that it automatically updates the catalog.

  11. On the next screen, select dojodb as the database. Click Next.

    Crawler Output

  12. On the next screen, verify all crawler information on the screen and click Finish to create the crawler.

  13. The crawler is created in no time. Select the crawler and click on the Run crawler button to run the crawler.

    Crawler Run

  14. It might take couple of minutes for the crawler to finish crawling the bucket. You would see a success message that there is one table customers created by the crawler in dojodb database.

    Crawler Tables

  15. Go back to the AWS Lake Formation console, click on the Tables menu in the left. You can see customers table created.

    Crawler Tables

  16. In the next step, you write a Glue Job using AWS Glue Studio.