AWS Data Wrangler Series - Part1- Working with Lambda
AWS Data Wrangler is an open source initiative from AWS Professional Services. It extends the power of Pandas by allowing to work AWS data related services using Panda DataFrames. One can use Python Pandas and AWS Data Wrangler to build ETL with major services - Athena, Glue, Redshift, Timestream, QuickSight, CloudWatchLogs, DynamoDB, EMR, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
In this exercise, you learn to use AWS Data Wrangler with AWS Lambda Function.
Step1: Pre-Requisite
You need to have an AWS account with administrative access to complete the exercise. If you don’t have an AWS account, kindly use the link to create free trial account for AWS.
Step2: Create IAM Role
You start with creation of the IAM role which AWS Lambda function uses for the authorization to call other AWS Services.
-
Login to the AWS Console. Select Paris as the region.
-
Goto the IAM Management console and click on the Roles menu in the left and then click on the Create role button.
-
On the next screen, select Lambda as the service and click on the Next: Permissions button.
-
On the next screen, select PowerUserAccess as the policy and click on the Next: Tags button.
-
On the next screen, click on the Next: Review button.
-
On the next screen, type in dojolambdarole for the Role name and click on the Create role button.
-
The role is created in no time. The next step is to create the S3 bucket and upload a sample data file.
Step3: Create S3 Bucket
You create an Amazon S3 bucket and upload a sample file customers.csv to it. The Lambda function will use Pandas and Data Wrangler to read this file, transform and then upload back to S3 bucket.
-
Download the sample data file customers.csv from the link. The data in the file looks like the following -
-
Goto AWS S3 Management Console. Create an S3 bucket with the name dojo-data-bucket. If this bucket name already exists; create a bucket with the name which is available. Create two folders - input and output in this bucket.
-
Navigate to the input folder and upload customers.csv file into it.
-
The S3 bucket and the data is ready. Let’s configure Lambda layer for the Data Wrangler in the next step.
Step4: Create Lambda Layer
You configure Lambda Layer for the AWS Data Wrangler which is then used by Lambda Function to call Pandas and Data Wrangler APIs.
-
Download awswrangler-layer-2.3.0-py3.8.zip file from the link.
-
Goto Lambda Management console. Select Layers menu in the left and then click on the Create layer button.
-
On the next screen, type in dojowrlayer for the name. Select Upload a .zip file option and upload awswrangler-layer-2.3.0-py3.8.zip file you downloaded in the previous step. Select Python 3.8 for the runtime. Finally click on the Create button.
-
The Lambda Layer is created in no time. In the next step, you create Lambda function.
Step5: Create Lambda Function
You create a Lambda function which uses layer and runs code to read S3 file, transform and then write back to the S3 bucket.
-
Goto Lambda Management console and click on the Create function button.
-
On the next screen, select Author from scratch as the option. Type in dojolambda as the Function name. Select Python 3.8 as the Runtime. Under Permissions, select Use an existing role as the option and then select dojolambdarole (you created in the earlier steps) as the role. Finally, click on the Create function button.
-
The Lambda function is created. Click on the Layers link.
-
It will expand the layers section on the Lambda configuration screen. Click on the Add a layer button.
-
On the next screen, select Custom layers option. Select dojowrlayer as the custom layer. Select 1 as the version. Click on the Add button.
-
The layer is added to the Lambda function. Next go to the basic settings section and click on the Edit button.
-
On the next screen, change memory to 256 MB and change timeout to 5 mins. Then click on the Save button.
-
Finally, time to write the code for the Lambda Function. Goto the Function code setting for the lambda function and replace the code with the code below and then click on the Deploy button. If you created bucket with a different name, update the code to use the bucket you created.
import json
import awswrangler as wr
import pandas as pd
from datetime import datetime
def lambda_handler(event, context):
df = wr.s3.read_csv("s3://dojo-data-bucket/input/", dataset=True)
df = df [["CUSTOMERNAME","EMAIL"]]
wr.s3.to_json(df,"s3://dojo-data-bucket/output/mydata.json")
return {
'statusCode': 200,
'body': json.dumps('Successful')
}
-
The function code is updated. In the code above, you first import pandas and awswrangler library from the Layer. You then use wrangler api to read data from the S3 bucket and populate to a pandas dataframe. You then apply a small transformation to select only two columns out of the dataframe. Finally, you use wrangler api to write the transformed dataframe to the S3 bucket in the JSON format.
-
The lambda function code and configuration is ready and good to run. Click on the Test button.
-
On the next screen, type in dojotest for the Event name. Leave the input to the default as you are not using input anyway. Finally, click on the Create button.
-
The test is created. Keeping dojotest selected, click on the Test button again.
-
The function runs successfully and you can see the success message returned.
-
The Lambda function has written the output in the S3 bucket. You can check the output json data file in S3.
-
This concludes the exercise. Please follow the next step to clean-up the resources so that you don’t incur any cost post the exercise.
Step7: Clean up
Delete dojowrlayer layer in the AWS Lambda console.
Delete dojolambda function in the AWS Lambda console.
Delete the dojo-data-bucket bucket in the S3 Management Console.
Delete dojolambdarole IAM role from the IAM Management console.
Thanks and hope you enjoyed the exercise.