Shuffling

Some processors cause Spark to shuffle the data, redistributing the data so that it's grouped differently across partitions. When shuffling data, Spark usually increases the number of partitions.

The following processors can cause Spark to shuffle the data:
  • Aggregate processor
  • Join processor
  • Rank processor
  • Repartition processor

Shuffling can be an expensive operation, so for optimal performance, design your pipelines to use as few processors that cause shuffling as possible.

When shuffling data, Spark creates the minimum number of partitions defined in the spark.sql.shuffle.partitions property. By default, the property is set to 200 partitions. As a result, Spark usually creates empty partitions when shuffling data. When the pipeline writes to a File destination, Spark uses one of these empty partitions to write an extra file that contains no data.

Tip: You can change the default value of the spark.sql.shuffle.partitions property for a specific pipeline by adding the property as an extra Spark configuration for the pipeline. Spark uses the defined value when it runs the pipeline.