Partitioning
When you start a pipeline, StreamSets Transformer launches a Spark application. Spark runs the application just as it runs any other application, splitting the pipeline data into partitions and performing operations on the partitions in parallel.
Spark automatically handles the partitioning of pipeline data for you. However, at times you might need to change the size and number of partitions.
Initial Partitions
When the pipeline starts processing a new batch, Spark determines how to split pipeline data into initial partitions based on the origins in the pipeline. Spark uses these partitions to process the output stream from the origin, unless a processor causes Spark to shuffle the data.
Spark determines how to initially split pipeline data into partitions based on the origin type. For example, for file-based origins, partitions are determined by the number of files being read and the file type. For the JDBC Table origin, the number of partitions to use is defined by a stage property.
For information about how an origin creates partitions, see "Partitioning" in the origin documentation.
Shuffling
Some processors cause Spark to shuffle the data, redistributing the data so that it's grouped differently across partitions. When shuffling data, Spark usually increases the number of partitions.
- Aggregate processor
- Join processor
- Rank processor
- Repartition processor
Shuffling can be an expensive operation, so for optimal performance, design your pipelines to use as few processors that cause shuffling as possible.
When shuffling data, Spark creates the minimum number of partitions defined in the
spark.sql.shuffle.partitions
property. By default, the property is
set to 200 partitions. As a result, Spark usually creates empty partitions when
shuffling data. When the pipeline writes to a File destination, Spark uses one of these
empty partitions to write an extra file that contains no data.
spark.sql.shuffle.partitions
property for a specific pipeline by
adding the property as an extra Spark configuration for the pipeline. Spark uses the
defined value when it runs the pipeline. Repartitioning
When you need to change the partitioning, use the Repartition processor in the pipeline.
The Repartition processor changes how pipeline data is partitioned. The processor redistributes data across partitions, increasing or decreasing the number of partitions as needed. The processor can randomly redistribute the data across the partitions or can redistribute the data by specified fields.
- Redistribute data that has become skewed, or unevenly distributed across partitions.
- Increase the number of partitions when the pipeline fails due to an out of memory error.
- Change the number of partitions that are written to file systems.
- Partition the data by field to improve the performance of downstream analytic queries.