Aggregate
The Aggregate processor performs aggregate calculations on data within a batch. The processor does not perform calculations across multiple batches. To perform calculations on subsets of a batch, you can specify fields to group by.
For example, you can use the Aggregate processor to calculate the sum of all purchases in
a batch grouped by state, and to write the results to a State_Total
output field in each record.
Records generated by the Aggregate processor include the output fields and the fields to group by. All other fields from the input record are dropped.
When you configure the Aggregate processor to perform a calculation, you specify the aggregation field, aggregate function, and output field to use. You can specify multiple sets of calculations to perform. You can optionally specify fields to group by for the calculations.
Aggregate Processing
The Aggregate processor can perform one or more aggregate calculations on data in a batch.
For each calculation that you want to perform, you specify the aggregate function, the field to use for the calculation, and an output field for the results.
By default, the processor performs calculations using all of the data in the batch. To perform calculations on subsets of data, specify one or more fields to group by.
After performing the calculations, the processor passes the output fields and the fields to group by to the generated records. The processor drops all other fields from the input records.
Example
- Use the
COUNT
function and output the results to aStateCount
field. - Use the
SUM
function with theTotal
field, and output the results to aStateTotal
field. - Set the processor to group by the
State
field.
Let's say a batch contains the following data:
TransactionID | StoreID | State | Total |
---|---|---|---|
0032355 | 35 | MD | 230.40 |
0016433 | 20 | IL | 90.50 |
0016434 | 20 | IL | 489.23 |
0032356 | 35 | MD | 63.50 |
0032357 | 35 | MD | 150.49 |
StateCount
and
StateTotal
fields in each record as follows:State | StateCount | StateTotal |
---|---|---|
MD | 3 | 444.39 |
IL | 2 | 579.73 |
Aggregate Functions
- Approximate Count Version - Returns the approximate number of distinct items in a group.
- Approximate Variance - Returns the variance of a values from a group. Approximate variance uses a sample of a group rather than the full population.
- Average - Returns the average of the values in a group.
- Count - Returns the number of items in a group.
- First - Returns the value of the first record in a group.
- Last - Returns the value of the last record in a group.
- Max - Returns the maximum value of a group.
- Median - Returns the median of the values of a group.
- Min - Returns the minimum value of a group.
- Sum - Returns the sum of the values of a group.
- Standard Deviation - Returns the standard deviation of a value from the group. Standard deviation uses a sample of a group rather than the full population.
Default Output Fields
When you configure the Aggregate processor, you can specify the output field to use. When you do not specify an output field, the Aggregate processor uses a default output field.
Aggregate Function | Default Output Field Name |
---|---|
Approximate Count Version | approx_count_distinct |
Approximate Variance | variance |
Average | avg |
Count | count |
First | first |
Last | last |
Max | max |
Median | median |
Min | min |
Sum | sum |
Standard Deviation | std_dev |
Configuring an Aggregate Processor
-
In the Properties panel, on the
General tab, configure the following properties:
General Property Description Name Stage name. Description Optional description. Cache Data Caches data processed for a batch so the data can be reused for multiple downstream stages. Use to improve performance when the stage passes data to multiple stages. Caching can limit pushdown optimization when the pipeline runs in ludicrous mode.
-
On the Aggregate tab, configure the following
properties:
Aggregate Property Description Aggregate Type and Field The aggregate calculations to perform. Configure the following properties: - Aggregate function - The aggregation function to use
in the calculation. Use one of the following
functions:
- Approximate Count Version - Returns the approximate number of distinct items in a group.
- Approximate Variance - Returns the variance of a values from a group. Approximate variance uses a sample of a group rather than the full population.
- Average - Returns the average of the values in a group.
- Count - Returns the number of items in a group.
- First - Returns the value of the first record in a group.
- Last - Returns the value of the last record in a group.
- Max - Returns the maximum value of a group.
- Median - Returns the median of the values of a group.
- Min - Returns the minimum value of a group.
- Sum - Returns the sum of the values of a group.
- Standard Deviation - Returns the standard deviation of a value from the group. Standard deviation uses a sample of a group rather than the full population.
- Aggregate Field - The field to use in the aggregate calculation.
- Output Field - The field for the results of the
calculation.
When you do not enter an output field, the Aggregate processor uses a default output field.
Click the Add icon to add additional calculations.
Group By Fields Optional fields to group by. Use to perform calculations on subsets of the data. When you list more than one field, the processor uses the specified field order. When not used, the processor performs calculations across all records in the batch.
- Aggregate function - The aggregation function to use
in the calculation. Use one of the following
functions: