Mastering Spark Query Plans: Wide Transformations — Joins

6 min readSep 8, 2024

In Spark, joins are considered wide transformations because they require shuffling of data across the cluster. When you join two DataFrames, Spark needs to ensure that rows with the same key (in our example, cust_id) are brought together on the same partition. This requires data movement across nodes, making joins computationally expensive operations.

In this article, we’ll explore how Spark handles joins under the hood by walking through the query plan for a typical join operation. We’ll use a sample DataFrame join and the explain() output to understand each step in the plan.

The Example Code

Let’s start with an example where we perform an inner join between two DataFrames: df_transactions and df_customers. To avoid broadcasting smaller tables, we explicitly disable broadcast joins by setting the Spark configuration.

Learn about Broadcast() Join:

Spark Optimization Techniques: Broadcast() Join

A Broadcast Join in Apache Spark is an optimization technique that is used to improve the performance of joins…

saiparvathaneni.medium.com

# Disable broadcast joins
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)

# Perform inner join
df_joined = (…

Mastering Spark Query Plans: Wide Transformations — Joins

The Example Code

Spark Optimization Techniques: Broadcast() Join

A Broadcast Join in Apache Spark is an optimization technique that is used to improve the performance of joins…

Written by Sai Parvathaneni