Rank

The Rank processor performs rank calculations for every input record based on a group of records. The processor performs calculations within a single batch; it does not calculate across multiple batches.

To group the records, you define the field to partition the data by. The Rank processor redistributes the input data by the specified field, placing records with the same value for the specified field in the same partition. To order the records within each partition, you define the field to order the data by. The processor orders the records in each partition, and then calculates the rank for each record.

For example, let's say that you want to rank employee salaries within each department. You configure the Rank processor to partition the data by the department field and then to order the data by the salary field. The processor ranks the salaries within the Sales department and separately ranks the salaries within the Marketing department.

The Rank processor passes all input fields to the output record, adding an additional output field to the record for each rank calculation.

When you configure the processor, you define the rank functions to calculate and the output field to use for each calculated value. The processor can perform multiple rank calculations. You also specify the field in the record to partition the data by and the field in the record to order the data by.

Tip: In streaming pipelines, you can use a Window processor upstream from this processor to generate larger batch sizes for evaluation.

Rank Processing

The Rank processor performs one or more rank calculations for every input record based on a group of records.

For each calculation, you specify the rank function and an output field for the results. To group the records, you define the field to partition the data by. And then you define the field to order the data by within each partition.

After performing the rank calculations, the processor passes all input fields and the output fields with the ranked values to the generated records.

Example

You want to rank cities by population size within each state. To do this, you configure the processor as follows:
  • Use the Rank function and output the results to a PopulationRank field.
  • Set the processor to partition by the State field.
  • Set the processor to order by the Population field in descending order.
Let's say a batch contains the following data:
City Population State
Davis 70220 CA
Westminster 18590 MD
Rockville 61209 MD
Santa Rosa 176439 CA
Manchester 4808 MD
The Rank processor splits the data into two partitions, placing all records where State is CA in one partition, and all records where State is MD in the next partition. The processor orders the records by population size within each partition, and then calculates the ranking value for each record. The processor produces five output records, writing the results of the rank calculation to the PopulationRank output field in each record as follows:
City Population State PopulationRank
Santa Rosa 176439 CA 1
Davis 70220 CA 2
Rockville 61209 MD 1
Westminster 18590 MD 2
Manchester 4808 MD 3

Rank Functions

Rank functions return a ranking value for each record in a partition. You can use the following functions with the Rank processor:

Rank
Returns the rank of a value in a group of values. Returns the same rank when values are identical, and then increments the next rank by the number of identical values, producing gaps in the ranking sequence.
For example, when the first three records have identical values, the Rank function returns 1 as the rank value and then skips to 4 for the rank value of the fourth record, as follows:
TotalSales Rank
5000 1
5000 1
5000 1
7000 4
Dense Rank
Returns the rank of a value in a group of values. Returns the same rank when values are identical, and then returns the next rank for the following record, which does not produce gaps in the ranking sequence.
For example, when the first three records have identical values, the Dense Rank function returns 1 as the rank value and then returns 2 for the rank value of the fourth record, as follows:
TotalSales DenseRank
5000 1
5000 1
5000 1
7000 2
Percent Rank
Returns the percentage ranking of a value in a group of values.
Ntile
Evenly divides the records for each partition into the specified number of buckets. Each bucket is numbered, starting at one.
Row Number
Returns a unique, sequential number for each record, starting with one, according to the ordering of records within the partition.

Partition and Order By Fields

When you configure the Rank processor, you must define the following fields:

Partition By Field
The partition by field determines how the Rank processor splits the input data into partitions. The Rank processor redistributes the data by the specified field, placing records with the same value for the specified field in the same partition. The processor can partition by one or more fields.
For example, let's say that you want to rank employee salaries by department. You configure the Rank processor to partition the data by the department field. The department field contains five possible values, so the processor creates five partitions. It then redistributes the input data by the department field, placing records with the same value for the department field in the same partition.
Order By Field
The order by field determines how the Rank processor orders the records within each partition. The processor can order the data in ascending or descending order. For example, to rank employee salaries by department, you configure the processor to order each department partition by the salary field in descending order.
The processor can order by one or more fields. When you order by multiple fields, the processor orders records according to the order of the listed fields on the Rank tab.
For example, let's say a batch contains the following data:
Name Grade Age TestScore
Emily Bedford 5 11 95
Connor Chu 2 9 80
Miguel Garcia 2 8 100
Anna Garcia 2 9 95
You configure the processor to partition by the grade field, and then add the following order by fields in this order, with each field set to descending order:
  • Age
  • TestScore
The processor splits the data into two partitions, one for grade 5 and one for grade 2. The processor orders each record in the partition first by age and then by test score, and then ranks each record. The processor writes the results of the rank calculation to the Rank output field in each record as follows:
Name Grade Age TestScore Rank
Emily Bedford 5 11 95 1
Anna Garcia 2 9 95 1
Connor Chu 2 9 80 2
Miguel Garcia 2 8 100 3
Note how the processor first orders the records in the grade 2 partition by age, and then by test score.

Configuring a Rank Processor

Configure a Rank processor to perform rank calculations for every input record based on a group of records.

  1. In the Properties panel, on the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
    Cache Data Caches data processed for a batch so the data can be reused for multiple downstream stages. Use to improve performance when the stage passes data to multiple stages.

    Caching can limit pushdown optimization when the pipeline runs in ludicrous mode.

  2. On the Rank tab, configure the following properties:
    Rank Property Description
    Rank Calculations Rank calculations to perform. Configure the following properties:
    • Rank Function - Rank function to use in the calculation. Use one of the following functions:
      • Rank - Returns the rank of a value in a group of values. Returns the same rank when values are identical, and then increments the next rank by the number of identical values, producing gaps in the ranking sequence.
      • Dense Rank - Returns the rank of a value in a group of values. Returns the same rank when values are identical, and then returns the next rank for the following record, which does not produce gaps in the ranking sequence.
      • Percent Rank - Returns the percentage ranking of a value in a group of values.
      • Ntile - Evenly divides the records for each partition into the specified number of buckets. Each bucket is numbered, starting at one.
      • Row Number - Returns a unique, sequential number for each record, starting with one, according to the ordering of records within the partition.
    • Output Field - Field for the results of the calculation.
    • Ntile Buckets - Number of buckets to divide the data into. Available for the Ntile function.

    Click the Add icon to add additional calculations.

    Partition By Fields Fields to partition by. The processor redistributes the data so that records with the same values for the specified fields are in the same partition.

    Click the Add icon to specify another field to partition by.

    Order By Fields Fields to order by within each partition. Configure the following properties:
    • Field Name - Name of the field to order by.
    • Direction - Direction to order the data, either ascending or descending.

    Click the Add icon to specify another field to order by. When you order by multiple fields, the processor orders records according to the order of the listed fields on the Rank tab.