Mastering Spark Query Plans: Wide Transformations — Repartition and Coalesce

Sai Parvathaneni
5 min readSep 6, 2024

In Spark, transformations can be categorized into narrow and wide transformations. While narrow transformations operate within the same partition, wide transformations involve shuffling data across partitions, often requiring data to be moved between different nodes in a cluster. These wide transformations are essential when you need to distribute or reorganize data to ensure efficient parallel processing.

In this article, we’ll explore two important wide transformations: Repartition and Coalesce, and walk through their query plans to understand how Spark processes these operations under the hood.

1. Repartition

The repartition() transformation is used to increase or decrease the number of partitions by shuffling data across the cluster. This operation is often used when you need to redistribute data for better parallelism…

--

--