Member-only story

Building an End-to-End Data Pipeline — Part 2 with AWS Services

12 min readApr 16, 2023

Introduction:

In Part 1 of this series, we discussed how to build an end-to-end data pipeline to retrieve data from a JSON API, transform it, and store the JSON data in AWS S3 using Airflow and Python. In this article, we’ll explore how to utilize various AWS services, including S3, Lambda, Glue, Athena, and Redshift, to further process and analyze the data.

In this Part 2, we will cover the following tasks:

Lambda function to reformat the JSON files
Create a Glue Crawler and Glue Tables on the JSON files in S3
Query Data with Athena
Transform the JSON files into Parquet format using Glue Job
Create a Glue Crawler and Glue Tables on the Parquet files in S3
Query data again with Athena before writing data into Redshift
Use Glue Studio to load data from the S3 bucket to Redshift
Load Parquet data into Redshift

Here is the architecture for this end-to-end data pipeline:

S3 Bucket (JSON data): Store the JSON data from the API.
AWS Lambda Function: Triggered…

Building an End-to-End Data Pipeline — Part 2 with AWS Services

Introduction:

In this Part 2, we will cover the following tasks:

Here is the architecture for this end-to-end data pipeline:

Written by Sai Parvathaneni

No responses yet