Repartition

The Repartition processor changes how pipeline data is partitioned. The processor redistributes data across partitions, increasing or decreasing the number of partitions as needed. The processor can randomly redistribute the data across the partitions or can redistribute the data by specified fields.

Spark runs a Transformer pipeline just as it runs any other application, splitting the data into partitions and performing operations on the partitions in parallel. Spark automatically handles the partitioning of data for you. However, at times you might need to control the size and number of partitions. When you need to change the partitioning, use the Repartition processor in the pipeline.

When you configure the Repartition processor, you select the repartition method to use and specify how to create partitions.

You can use multiple Repartition processors in a pipeline. However, as a best practice, design your pipeline to use as few Repartition processors as possible. The Repartition processor causes Spark to shuffle the data, redistributing the data so that it's grouped differently across the partitions, which can be an expensive operation.