Mastering Spark Query Plans: Narrow Transformations
In this article, we will explore narrow transformations in Apache Spark. We will go through the query plan produced by Spark and explain each component, step-by-step, using an example transformation that filters, splits names, and updates column values.
What are Narrow Transformations?
Narrow transformations in Spark are operations where each partition of data is processed independently without needing to shuffle data between different nodes in the cluster. Some common examples of narrow transformations are filtering rows, adding or modifying columns, and selecting specific columns. The key is that these operations happen within the same partition of data.
Let’s look at an example and break down how Spark handles these transformations under the hood using its query plan.
The Example Code
Here’s a code snippet where we perform some common narrow transformations on a DataFrame of customers:
df_narrow_transform = (
df_customers
.filter(F.col("city") == "boston")
.withColumn("first_name", F.split("name", " ").getItem(0))
.withColumn("last_name", F.split("name", " ").getItem(1))
.withColumn("age", F.col("age") + F.lit(5))
.select("cust_id", "first_name", "last_name", "age"…