Sort

The Sort processor sorts incoming data based on one or more specified fields. The processor can sort data in ascending or descending order.

For example, let's say that you create a batch pipeline to read all available data in the orders table in a relational database, transform the data, and then write the data to a destination system. Before writing the data, you want the pipeline to sort all records by the order ID. To do this, you add a Sort processor before the destination, and configure the processor to sort by the order_id field in ascending order.

You can configure the Sort processor to sort by one or more fields.
Tip: In streaming pipelines, you can use a Window processor upstream from this processor to generate larger batch sizes for evaluation.

Sort by Multiple Fields

When you sort by multiple fields, the Sort processor sorts data according to the order of the listed fields on the Sort tab.

For example, let's say that your pipeline processes student data. You want to sort the students first by grade level, then by last name, and then by first name. You add a Sort processor to the pipeline. On the Sort tab of the processor, you add the following fields in this order, with each field set to ascending order:
  • grade
  • last_name
  • first_name

You preview the pipeline with sample data. Preview displays the following input and output data for the Sort processor, showing how the record order has changed:

Notice how grade 2 students are listed in the last three records in the input data, but the processor reorders them as the first three records in the output data. The output data also shows how the processor additionally sorts the grade 2 students alphabetically by last name and then by first name.

Configuring a Sort Processor

Configure a Sort processor to sort incoming data based on specified fields.

  1. In the Properties panel, on the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
    Cache Data Caches data processed for a batch so the data can be reused for multiple downstream stages. Use to improve performance when the stage passes data to multiple stages.

    Caching can limit pushdown optimization when the pipeline runs in ludicrous mode.

  2. On the Sort tab, configure the following properties for the field that you want to sort by:
    Sort Property Description
    Field Name of the field in the input data to sort by.
    Order Order to sort the data:
    • Ascending
    • Descending
  3. To sort by additional fields, click the Add icon to specify another field name and sort order.
    You can use simple or bulk edit mode to configure the fields.
    When configured to sort by multiple fields, the processor sorts data according to the order of the listed fields on the Sort tab.