Rank
The Rank processor performs rank calculations for every input record based on a group of records. The processor performs calculations within a single batch; it does not calculate across multiple batches.
To group the records, you define the field to partition the data by. The Rank processor redistributes the input data by the specified field, placing records with the same value for the specified field in the same partition. To order the records within each partition, you define the field to order the data by. The processor orders the records in each partition, and then calculates the rank for each record.
For example, let's say that you want to rank employee salaries within each department. You configure the Rank processor to partition the data by the department field and then to order the data by the salary field. The processor ranks the salaries within the Sales department and separately ranks the salaries within the Marketing department.
The Rank processor passes all input fields to the output record, adding an additional output field to the record for each rank calculation.
When you configure the processor, you define the rank functions to calculate and the output field to use for each calculated value. The processor can perform multiple rank calculations. You also specify the field in the record to partition the data by and the field in the record to order the data by.
Rank Processing
The Rank processor performs one or more rank calculations for every input record based on a group of records.
For each calculation, you specify the rank function and an output field for the results. To group the records, you define the field to partition the data by. And then you define the field to order the data by within each partition.
After performing the rank calculations, the processor passes all input fields and the output fields with the ranked values to the generated records.
Example
You want to rank cities by population size within each state. To do this, you configure the processor as follows:- Use the
Rank
function and output the results to aPopulationRank
field. - Set the processor to partition by the
State
field. - Set the processor to order by the
Population
field in descending order.
City | Population | State |
---|---|---|
Davis | 70220 | CA |
Westminster | 18590 | MD |
Rockville | 61209 | MD |
Santa Rosa | 176439 | CA |
Manchester | 4808 | MD |
State
is CA
in one partition, and
all records where State
is MD
in the next
partition. The processor orders the records by population size within each
partition, and then calculates the ranking value for each record. The processor
produces five output records, writing the results of the rank calculation to the
PopulationRank
output field in each record as follows:City | Population | State | PopulationRank |
---|---|---|---|
Santa Rosa | 176439 | CA | 1 |
Davis | 70220 | CA | 2 |
Rockville | 61209 | MD | 1 |
Westminster | 18590 | MD | 2 |
Manchester | 4808 | MD | 3 |
Rank Functions
Rank functions return a ranking value for each record in a partition. You can use the following functions with the Rank processor:
- Rank
- Returns the rank of a value in a group of values. Returns the same rank when values are identical, and then increments the next rank by the number of identical values, producing gaps in the ranking sequence.
- Dense Rank
- Returns the rank of a value in a group of values. Returns the same rank when values are identical, and then returns the next rank for the following record, which does not produce gaps in the ranking sequence.
- Percent Rank
- Returns the percentage ranking of a value in a group of values.
- Ntile
- Evenly divides the records for each partition into the specified number of buckets. Each bucket is numbered, starting at one.
- Row Number
- Returns a unique, sequential number for each record, starting with one, according to the ordering of records within the partition.
Partition and Order By Fields
When you configure the Rank processor, you must define the following fields:
- Partition By Field
- The partition by field determines how the Rank processor splits the input data into partitions. The Rank processor redistributes the data by the specified field, placing records with the same value for the specified field in the same partition. The processor can partition by one or more fields.
- Order By Field
- The order by field determines how the Rank processor orders the records within each partition. The processor can order the data in ascending or descending order. For example, to rank employee salaries by department, you configure the processor to order each department partition by the salary field in descending order.
Configuring a Rank Processor
Configure a Rank processor to perform rank calculations for every input record based on a group of records.
-
In the Properties panel, on the General tab, configure the
following properties:
General Property Description Name Stage name. Description Optional description. Cache Data Caches data processed for a batch so the data can be reused for multiple downstream stages. Use to improve performance when the stage passes data to multiple stages. Caching can limit pushdown optimization when the pipeline runs in ludicrous mode.
-
On the Rank tab, configure the following
properties:
Rank Property Description Rank Calculations Rank calculations to perform. Configure the following properties: - Rank Function - Rank function to use in
the calculation. Use one of the following
functions:
- Rank - Returns the rank of a value in a group of values. Returns the same rank when values are identical, and then increments the next rank by the number of identical values, producing gaps in the ranking sequence.
- Dense Rank - Returns the rank of a value in a group of values. Returns the same rank when values are identical, and then returns the next rank for the following record, which does not produce gaps in the ranking sequence.
- Percent Rank - Returns the percentage ranking of a value in a group of values.
- Ntile - Evenly divides the records for each partition into the specified number of buckets. Each bucket is numbered, starting at one.
- Row Number - Returns a unique, sequential number for each record, starting with one, according to the ordering of records within the partition.
- Output Field - Field for the results of the calculation.
- Ntile Buckets - Number of buckets to divide the data into. Available for the Ntile function.
Click the Add icon to add additional calculations.
Partition By Fields Fields to partition by. The processor redistributes the data so that records with the same values for the specified fields are in the same partition. Click the Add icon to specify another field to partition by.
Order By Fields Fields to order by within each partition. Configure the following properties: - Field Name - Name of the field to order by.
- Direction - Direction to order the data, either ascending or descending.
Click the Add icon to specify another field to order by. When you order by multiple fields, the processor orders records according to the order of the listed fields on the Rank tab.
- Rank Function - Rank function to use in
the calculation. Use one of the following
functions: