Access S3 Data in Amazon Redshift using Redshift Spectrum
Amazon Redshift is the cloud data warehouse in AWS. Amazon Redshift provides seamless integration with other storages like Amazon S3. It enable a very cost effective data warehouse solution where the warm data can be kept in Amazon Redshift storage while the cold data can be kept in the S3 storage. The user can access the S3 data from Redshift in the same way, the data is accessed from the Redshift storage itself. Amazon Redshift enables S3 data access from Amazon Redshift.
In this exercise, you learn how to use Amazon Redshift Spectrum to access S3 based data from Amazon Redshift.
The AWS Resource consumption for the exercise does not fall under AWS Free Tier.
Step1: Pre-Requisite
You need to have an AWS account with administrative access to complete the exercise. If you don’t have an AWS account, kindly use the link to create free trial account for AWS.
Step2: Create S3 Bucket
A sample data file (sampledata.csv) has been provided for the exercise. Download it from the link.
-
Login to the AWS Console and choose Ireland as the region.
-
Goto S3 Management studio. Create an S3 bucket with name dojo-data. If the bucket name is not available, create bucket with a name which is available. Upload sampledata.csv file to the bucket.
-
The sampledata.csv file looks like below. It has dummy data of countries and their business index. You will access this data from Redshift cluster using Redshift Spectrum.
-
The data is ready. The next step is to create IAM Role for the Redshift cluster.
Step3: Create IAM Role.
You create IAM Role for the Redshift cluster which is used to provide access to the data in the S3 bucket.
-
Goto the IAM Management console and click on the Roles menu in the left and then click on the Create role button.
-
On the next screen, select Redshift - Customizable as the service \ use case and click on the Next: Permissions button.
-
On the next screen, select PowerUserAccess as the policy and click on the Next: Tags button.
-
On the next screen, click on the Next: Review button.
-
On the next screen, type in dojoredshiftrole as the role name and click on the Create role button.
-
The role is created in no time. Please make note of the Role ARN as you need it later in the exercise. The next step is to launch the Redshift cluster.
Step4: Launch Redshift Cluster
You launch a Redshift cluster which is used to query the s3 based data.
-
Goto Redshift Management Console and click on the Create cluster button.
-
On the next screen, type in dojoredshift for the cluster identifier and select Free trial option.
-
On the same screen, in the Database configurations section, type in Password1! for the master user password field. Keep rest of the fields as the default.
-
On the same screen, expand Cluster permissions (optional) section. Select dojoredshiftrole as the IAM Role and click on the Add IAM Role button (please make sure you click this button to add the role to the cluster). Finally click on the Create cluster in the bottom of the screen.
-
It will take a moment to create the Redshift cluster. Wait till the status of the cluster changes to Available.
-
Few things to remember about the cluster which will be useful later.
cluster identifier: dojoredshift
username: awsuser
password: Password1!
database: dev
-
The cluster is ready. Let’s use it to access S3 based sample data.
Step5: Insert Data in Redshift Database
In this step, you create a new schema in the Redshift cluster database and then create a table in the schema using the S3 based data. You then use this table to access data from the S3 bucket.
-
In the Redshift Management Console, click on the EDITOR menu in the left. It will open the Connect to database popup. Select Create new connection option. Select dojoredshift as the cluster. Type in dev for the database. Type in awsuser for the user and Password1! for the password. Finally click on the Connect to database button to connect to the cluster.
-
On the next screen, run the following SQL statement to create an external schema with name s3dataschema. Replace {Redshift-Role-ARN} with the ARN of the IAM Role you created in the previous step.
create external schema s3dataschema from data catalog database 'dev' iam_role '{Redshift-Role-ARN}' create external database if not exists;
`
-
It will create a new schema s3dataschema in no time. You can check the schema created in the data objects list.
-
Next run the following SQL statement to create a table countrybusinessindex in the schema s3dataschema which sources the data from the S3 bucket s3://dojo-data. If you create bucket with a different name then replace dojo-data part with that.
create external table s3dataschema.countrybusinessindex( country nvarchar(100), businessindex int ) row format delimited fields terminated by ',' stored as textfile location 's3://dojo-data';
`
-
The table is created in no time. You can see the table listed under the s3dataschema in the data objects list.
-
Next run the following SQL query to fetch data from the s3dataschema.countrybusinessindex which eventually is fetched from the S3 bucket.
select * from s3dataschema.countrybusinessindex
`
-
The query will show data which is fetched from the S3 bucket sample data.
-
This was a quick exercise to understand how Redshift cluster can be configured to fetch data from the s3 bucket. Follow the next step to clean-up the resources so that you don’t incur any cost post the exercise.
Step6: Clean up
Delete dojoredshift Redshift Cluster.
Delete dojo-data S3 Bucket.
Delete dojoredshiftrole IAM Role.
Thanks and hope you enjoyed the exercise.