Shuffling
Some processors cause Spark to shuffle the data, redistributing the data so that it's grouped differently across partitions. When shuffling data, Spark usually increases the number of partitions.
- Aggregate processor
- Join processor
- Rank processor
- Repartition processor
Shuffling can be an expensive operation, so for optimal performance, design your pipelines to use as few processors that cause shuffling as possible.
When shuffling data, Spark creates the minimum number of partitions defined in the
spark.sql.shuffle.partitions
property. By default, the property is
set to 200 partitions. As a result, Spark usually creates empty partitions when
shuffling data. When the pipeline writes to a File destination, Spark uses one of these
empty partitions to write an extra file that contains no data.
spark.sql.shuffle.partitions
property for a specific pipeline by
adding the property as an extra Spark configuration for the pipeline. Spark uses the
defined value when it runs the pipeline.