Aggregate

The Aggregate processor performs aggregate calculations on data within a batch. The processor does not perform calculations across multiple batches. To perform calculations on subsets of a batch, you can specify fields to group by.

For example, you can use the Aggregate processor to calculate the sum of all purchases in a batch grouped by state, and to write the results to a State_Total output field in each record.

Records generated by the Aggregate processor include the output fields and the fields to group by. All other fields from the input record are dropped.

When you configure the Aggregate processor to perform a calculation, you specify the aggregation field, aggregate function, and output field to use. You can specify multiple sets of calculations to perform. You can optionally specify fields to group by for the calculations.

Tip: In streaming pipelines, you can use a Window processor upstream from this processor to generate larger batch sizes for evaluation.

Aggregate Processing

The Aggregate processor can perform one or more aggregate calculations on data in a batch.

For each calculation that you want to perform, you specify the aggregate function, the field to use for the calculation, and an output field for the results.

By default, the processor performs calculations using all of the data in the batch. To perform calculations on subsets of data, specify one or more fields to group by.

After performing the calculations, the processor passes the output fields and the fields to group by to the generated records. The processor drops all other fields from the input records.

Example

Say you want to know the number of transactions and the total for those transactions for each state. To do this, you configure the processor as follows:
  • Use the COUNT function and output the results to a StateCount field.
  • Use the SUM function with the Total field, and output the results to a StateTotal field.
  • Set the processor to group by the State field.

Let's say a batch contains the following data:

TransactionID StoreID State Total
0032355 35 MD 230.40
0016433 20 IL 90.50
0016434 20 IL 489.23
0032356 35 MD 63.50
0032357 35 MD 150.49
The Aggregate processor produces two output records, one for the state of Maryland and the other for the state of Illinois. The processor writes the results of the aggregate calculations to the StateCount and StateTotal fields in each record as follows:
State StateCount StateTotal
MD 3 444.39
IL 2 579.73

Aggregate Functions

You can use the following aggregate functions with the Aggregate processor:
  • Approximate Count Version - Returns the approximate number of distinct items in a group.
  • Approximate Variance - Returns the variance of a values from a group. Approximate variance uses a sample of a group rather than the full population.
  • Average - Returns the average of the values in a group.
  • Count - Returns the number of items in a group.
  • First - Returns the value of the first record in a group.
  • Last - Returns the value of the last record in a group.
  • Max - Returns the maximum value of a group.
  • Median - Returns the median of the values of a group.
  • Min - Returns the minimum value of a group.
  • Sum - Returns the sum of the values of a group.
  • Standard Deviation - Returns the standard deviation of a value from the group. Standard deviation uses a sample of a group rather than the full population.

Default Output Fields

When you configure the Aggregate processor, you can specify the output field to use. When you do not specify an output field, the Aggregate processor uses a default output field.

The default output field is named based on the aggregation function that you use. The following table lists the names of the output fields that are used when you do not specify an output field:
Aggregate Function Default Output Field Name
Approximate Count Version approx_count_distinct
Approximate Variance variance
Average avg
Count count
First first
Last last
Max max
Median median
Min min
Sum sum
Standard Deviation std_dev

Configuring an Aggregate Processor

Configure an Aggregate processor to perform aggregate calculations across a batch of data.
  1. In the Properties panel, on the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
    Cache Data Caches data processed for a batch so the data can be reused for multiple downstream stages. Use to improve performance when the stage passes data to multiple stages.

    Caching can limit pushdown optimization when the pipeline runs in ludicrous mode.

  2. On the Aggregate tab, configure the following properties:
    Aggregate Property Description
    Aggregate Type and Field The aggregate calculations to perform. Configure the following properties:
    • Aggregate function - The aggregation function to use in the calculation. Use one of the following functions:
      • Approximate Count Version - Returns the approximate number of distinct items in a group.
      • Approximate Variance - Returns the variance of a values from a group. Approximate variance uses a sample of a group rather than the full population.
      • Average - Returns the average of the values in a group.
      • Count - Returns the number of items in a group.
      • First - Returns the value of the first record in a group.
      • Last - Returns the value of the last record in a group.
      • Max - Returns the maximum value of a group.
      • Median - Returns the median of the values of a group.
      • Min - Returns the minimum value of a group.
      • Sum - Returns the sum of the values of a group.
      • Standard Deviation - Returns the standard deviation of a value from the group. Standard deviation uses a sample of a group rather than the full population.
    • Aggregate Field - The field to use in the aggregate calculation.
    • Output Field - The field for the results of the calculation.

      When you do not enter an output field, the Aggregate processor uses a default output field.

    Click the Add icon to add additional calculations.

    Group By Fields Optional fields to group by. Use to perform calculations on subsets of the data.

    When you list more than one field, the processor uses the specified field order. When not used, the processor performs calculations across all records in the batch.