Partitioning

When you start a pipeline, StreamSets Transformer launches a Spark application. Spark runs the application just as it runs any other application, splitting the pipeline data into partitions and performing operations on the partitions in parallel.

Spark automatically handles the partitioning of pipeline data for you. However, at times you might need to change the size and number of partitions.

Initial Partitions

When the pipeline starts processing a new batch, Spark determines how to split pipeline data into initial partitions based on the origins in the pipeline. Spark uses these partitions to process the output stream from the origin, unless a processor causes Spark to shuffle the data.

Spark determines how to initially split pipeline data into partitions based on the origin type. For example, for file-based origins, partitions are determined by the number of files being read and the file type. For the JDBC Table origin, the number of partitions to use is defined by a stage property.

For information about how an origin creates partitions, see "Partitioning" in the origin documentation.

Shuffling

Some processors cause Spark to shuffle the data, redistributing the data so that it's grouped differently across partitions. When shuffling data, Spark usually increases the number of partitions.

The following processors can cause Spark to shuffle the data:
  • Aggregate processor
  • Join processor
  • Rank processor
  • Repartition processor

Shuffling can be an expensive operation, so for optimal performance, design your pipelines to use as few processors that cause shuffling as possible.

When shuffling data, Spark creates the minimum number of partitions defined in the spark.sql.shuffle.partitions property. By default, the property is set to 200 partitions. As a result, Spark usually creates empty partitions when shuffling data. When the pipeline writes to a File destination, Spark uses one of these empty partitions to write an extra file that contains no data.

Tip: You can change the default value of the spark.sql.shuffle.partitions property for a specific pipeline by adding the property as an extra Spark configuration for the pipeline. Spark uses the defined value when it runs the pipeline.

Repartitioning

When you need to change the partitioning, use the Repartition processor in the pipeline.

The Repartition processor changes how pipeline data is partitioned. The processor redistributes data across partitions, increasing or decreasing the number of partitions as needed. The processor can randomly redistribute the data across the partitions or can redistribute the data by specified fields.

You might use the Repartition processor to repartition pipeline data for the following reasons:
  • Redistribute data that has become skewed, or unevenly distributed across partitions.
  • Increase the number of partitions when the pipeline fails due to an out of memory error.
  • Change the number of partitions that are written to file systems.
  • Partition the data by field to improve the performance of downstream analytic queries.