You can create Amazon S3 bucket and upload data for the data lake. For this workshop, you are uploading data manually but in actual production usage the data is uploaded using data ingestion services / pipeline such as AWS Glue, Amazon Kinesis etc. Since the focus of the workshop is building the data lake, it keeps the data ingestion out of scope.
-
Goto S3 Management console, use + Create bucket button to create a new bucket with name dojo-datalake. If the bucket name is not available, then use a different name which is available. Please make sure the bucket is created in the same region which you choose to work on for this workshop.
-
Click on the dojo-datalake bucket to open it. Within the dojo-datalake, create two folders data and script using the + Create folder button. The data folder is used to keep the data lake data while the script folder is used by Amazon Athena which will be explained later.
-
Click on the data folder to open it. Within the data folder, create two folders customers and sales using the + Create folder button.
-
Click to open customers and sales folders one by one and upload customers.csv and sales.csv files in the customers and sales folders respectively. Use Upload button to upload the files. The customers.csv and sales.csv files are available for download using the following links -
-
The data is ready. Let’s start with the data lake configuration.