Rank

The Rank processor performs rank calculations for every input record based on a group of records. The processor performs calculations within a single batch; it does not calculate across multiple batches.

To group the records, you define the field to partition the data by. The Rank processor redistributes the input data by the specified field, placing records with the same value for the specified field in the same partition. To order the records within each partition, you define the field to order the data by. The processor orders the records in each partition, and then calculates the rank for each record.

For example, let's say that you want to rank employee salaries within each department. You configure the Rank processor to partition the data by the department field and then to order the data by the salary field. The processor ranks the salaries within the Sales department and separately ranks the salaries within the Marketing department.

The Rank processor passes all input fields to the output record, adding an additional output field to the record for each rank calculation.

When you configure the processor, you define the rank functions to calculate and the output field to use for each calculated value. The processor can perform multiple rank calculations. You also specify the field in the record to partition the data by and the field in the record to order the data by.

Tip: In streaming pipelines, you can use a Window processor upstream from this processor to generate larger batch sizes for evaluation.

Rank Processing

The Rank processor performs one or more rank calculations for every input record based on a group of records.

For each calculation, you specify the rank function and an output field for the results. To group the records, you define the field to partition the data by. And then you define the field to order the data by within each partition.

After performing the rank calculations, the processor passes all input fields and the output fields with the ranked values to the generated records.

Example

You want to rank cities by population size within each state. To do this, you configure the processor as follows:

Use the Rank function and output the results to a PopulationRank field.
Set the processor to partition by the State field.
Set the processor to order by the Population field in descending order.

Let's say a batch contains the following data:


City	Population	State
Davis	70220	CA
Westminster	18590	MD
Rockville	61209	MD
Santa Rosa	176439	CA
Manchester	4808	MD

The Rank processor splits the data into two partitions, placing all records where State is CA in one partition, and all records where State is MD in the next partition. The processor orders the records by population size within each partition, and then calculates the ranking value for each record. The processor produces five output records, writing the results of the rank calculation to the PopulationRank output field in each record as follows:


City	Population	State	PopulationRank
Santa Rosa	176439	CA	1
Davis	70220	CA	2
Rockville	61209	MD	1
Westminster	18590	MD	2
Manchester	4808	MD	3

Rank Functions

Rank functions return a ranking value for each record in a partition. You can use the following functions with the Rank processor:

Rank

Returns the rank of a value in a group of values. Returns the same rank when values are identical, and then increments the next rank by the number of identical values, producing gaps in the ranking sequence.

For example, when the first three records have identical values, the Rank function returns 1 as the rank value and then skips to 4 for the rank value of the fourth record, as follows:


TotalSales	Rank
5000	1
5000	1
5000	1
7000	4

Dense Rank

Returns the rank of a value in a group of values. Returns the same rank when values are identical, and then returns the next rank for the following record, which does not produce gaps in the ranking sequence.

For example, when the first three records have identical values, the Dense Rank function returns 1 as the rank value and then returns 2 for the rank value of the fourth record, as follows:


TotalSales	DenseRank
5000	1
5000	1
5000	1
7000	2

Percent Rank

Returns the percentage ranking of a value in a group of values.

Ntile

Evenly divides the records for each partition into the specified number of buckets. Each bucket is numbered, starting at one.

Row Number

Returns a unique, sequential number for each record, starting with one, according to the ordering of records within the partition.

Partition and Order By Fields

When you configure the Rank processor, you must define the following fields:

Partition By Field

The partition by field determines how the Rank processor splits the input data into partitions. The Rank processor redistributes the data by the specified field, placing records with the same value for the specified field in the same partition. The processor can partition by one or more fields.

For example, let's say that you want to rank employee salaries by department. You configure the Rank processor to partition the data by the department field. The department field contains five possible values, so the processor creates five partitions. It then redistributes the input data by the department field, placing records with the same value for the department field in the same partition.

Order By Field

The order by field determines how the Rank processor orders the records within each partition. The processor can order the data in ascending or descending order. For example, to rank employee salaries by department, you configure the processor to order each department partition by the salary field in descending order.

The processor can order by one or more fields. When you order by multiple fields, the processor orders records according to the order of the listed fields on the Rank tab.

For example, let's say a batch contains the following data:


Name	Grade	Age	TestScore
Emily Bedford	5	11	95
Connor Chu	2	9	80
Miguel Garcia	2	8	100
Anna Garcia	2	9	95

You configure the processor to partition by the grade field, and then add the following order by fields in this order, with each field set to descending order:

Age
TestScore

The processor splits the data into two partitions, one for grade 5 and one for grade 2. The processor orders each record in the partition first by age and then by test score, and then ranks each record. The processor writes the results of the rank calculation to the Rank output field in each record as follows:


Name	Grade	Age	TestScore	Rank
Emily Bedford	5	11	95	1
Anna Garcia	2	9	95	1
Connor Chu	2	9	80	2
Miguel Garcia	2	8	100	3

Note how the processor first orders the records in the grade 2 partition by age, and then by test score.

Configuring a Rank Processor

Configure a Rank processor to perform rank calculations for every input record based on a group of records.

In the Properties panel, on the General tab, configure the following properties:


General Property	Description
Name	Stage name.
Description	Optional description.
Cache Data	Caches data processed for a batch so the data can be reused for multiple downstream stages. Use to improve performance when the stage passes data to multiple stages. Caching can limit pushdown optimization when the pipeline runs in ludicrous mode.

On the Rank tab, configure the following properties:


Rank Property	Description
Rank Calculations	Rank calculations to perform. Configure the following properties: Rank Function - Rank function to use in the calculation. Use one of the following functions: Rank - Returns the rank of a value in a group of values. Returns the same rank when values are identical, and then increments the next rank by the number of identical values, producing gaps in the ranking sequence. Dense Rank - Returns the rank of a value in a group of values. Returns the same rank when values are identical, and then returns the next rank for the following record, which does not produce gaps in the ranking sequence. Percent Rank - Returns the percentage ranking of a value in a group of values. Ntile - Evenly divides the records for each partition into the specified number of buckets. Each bucket is numbered, starting at one. Row Number - Returns a unique, sequential number for each record, starting with one, according to the ordering of records within the partition. Output Field - Field for the results of the calculation. Ntile Buckets - Number of buckets to divide the data into. Available for the Ntile function. Click the Add icon to add additional calculations.
Partition By Fields	Fields to partition by. The processor redistributes the data so that records with the same values for the specified fields are in the same partition. Click the Add icon to specify another field to partition by.
Order By Fields	Fields to order by within each partition. Configure the following properties: Field Name - Name of the field to order by. Direction - Direction to order the data, either ascending or descending. Click the Add icon to specify another field to order by. When you order by multiple fields, the processor orders records according to the order of the listed fields on the Rank tab.