AWS Dojo - Workshop - Using Custom Transformation in AWS Glue Studio

One of the fundamental principle of building the data lake is that every data in the data lake should be catalogued. The catalog is automated using crawlers in AWS Glue. The crawler uses role based authorization to create catalog in the data lake database. You created an IAM Role dojocrawlerrole in the earlier task which the crawler will use to create data catalog in the database. You need to assign database permission for this role. After the permission configuration, you will create and run crawler to catalog the data.

Open the AWS Lake Formation console, click on the Databases option in the left. You will see dojodb database listed.
Select the dojodb database and click on the Grant menu option under the Action dropdown menu.
On Grant permissions screen, choose for the My account option and select dojocrawlerrole for the IAM users and roles field. Select only Create table and Alter permissions for the Database permissions. Then click on the Grant button. It means you are authorizing crawler role to be able to create and alter tables in the database.
After assigning permission, time to configure and run crawler. Open the AWS Glue console. Click on the Crawlers menu on the left and then click on the Add crawler button.
On the next screen, enter dojocrawler as the Crawler name and click Next.
On the next screen, select Data stores as the Crawler source type and click Next.
On the next screen, select S3 as a data store and select Specified path in my account option. Provide the include path for the S3 bucket as s3://dojo-customer-data/data/customers. If you created the bucket with a different name, then replace dojo-customer-data part with that name. Click Next.
On the next screen, you are going to crawl only one data store, so select No from the option and click Next.
On the next screen, select Choose an existing IAM role option and then select IAM role dojocrawlerrole from the dropdown list. Click Next.
On the next screen, select Run on demand as the frequency and click Next. In actual production use, you generally schedule crawler run so that it automatically updates the catalog.
On the next screen, select dojodb as the database. Click Next.
On the next screen, verify all crawler information on the screen and click Finish to create the crawler.
The crawler is created in no time. Select the crawler and click on the Run crawler button to run the crawler.
It might take couple of minutes for the crawler to finish crawling the bucket. You would see a success message that there is one table customers created by the crawler in dojodb database.
Go back to the AWS Lake Formation console, click on the Tables menu in the left. You can see customers table created.
In the next step, you write a Glue Job using AWS Glue Studio.