Member-only story
Building an End-to-End Data Pipeline — Part 2 with AWS Services
12 min readApr 16, 2023
Introduction:
In Part 1 of this series, we discussed how to build an end-to-end data pipeline to retrieve data from a JSON API, transform it, and store the JSON data in AWS S3 using Airflow and Python. In this article, we’ll explore how to utilize various AWS services, including S3, Lambda, Glue, Athena, and Redshift, to further process and analyze the data.
In this Part 2, we will cover the following tasks:
- Lambda function to reformat the JSON files
- Create a Glue Crawler and Glue Tables on the JSON files in S3
- Query Data with Athena
- Transform the JSON files into Parquet format using Glue Job
- Create a Glue Crawler and Glue Tables on the Parquet files in S3
- Query data again with Athena before writing data into Redshift
- Use Glue Studio to load data from the S3 bucket to Redshift
- Load Parquet data into Redshift
Here is the architecture for this end-to-end data pipeline:
- S3 Bucket (JSON data): Store the JSON data from the API.
- AWS Lambda Function: Triggered…