Mastering Spark Query Plans: Wide Transformations — Joins
In Spark, joins are considered wide transformations because they require shuffling of data across the cluster. When you join two DataFrames, Spark needs to ensure that rows with the same key (in our example, cust_id
) are brought together on the same partition. This requires data movement across nodes, making joins computationally expensive operations.
In this article, we’ll explore how Spark handles joins under the hood by walking through the query plan for a typical join operation. We’ll use a sample DataFrame join and the explain()
output to understand each step in the plan.
The Example Code
Let’s start with an example where we perform an inner join between two DataFrames: df_transactions
and df_customers
. To avoid broadcasting smaller tables, we explicitly disable broadcast joins by setting the Spark configuration.
Learn about Broadcast() Join:
# Disable broadcast joins
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
# Perform inner join
df_joined = (…