Profile

The Profile processor calculates descriptive statistics for string and numeric data. Use the Profile processor to help you profile and understand data.

The processor calculates count, mean, standard deviation, minimum, and maximum statistics across all records in the batch. The processor calculates the statistics for string and numeric fields only, ignoring all other fields in the record.

The processor generates a total of five output records for each batch, one record for each calculated statistic. Each output record includes a summary field that lists the type statistic calculated for the record. The remaining fields contain the calculated statistic for that field across all records in the batch.

When you configure the Profile processor, you define whether the processor profiles all fields or specific fields in each record.

Tip: In streaming pipelines, you can use a Window processor upstream from this processor to generate larger batch sizes for evaluation.

Profile Statistics

The Profile processor calculates the following statistics for each specified string or numeric field across all records in the batch:
Count
Returns the number of non-null values for the field. The processor does not include null values in the count.
Mean
Returns the mean value of the field. When calculating the mean of a string field, the processor returns a null value.
Standard deviation
Returns the standard deviation value of the field. When calculating the standard deviation of a string field, the processor returns a null value.
Minimum
Returns the minimum value of the field. When calculating the minimum value of a string field, the processor alphabetizes the values and returns the first value in the alphabetized list.
Maximum
Returns the maximum value of the field. When calculating the maximum value of a string field, the processor alphabetizes the values and returns the last value in the alphabetized list.

Output Records

The Profile processor generates a total of five output records for each batch, one record for each of the following calculated statistics:
  • Count
  • Mean
  • Standard deviation
  • Minimum
  • Maximum

Each output record includes a summary field that lists the type of statistic calculated for the record. The remaining fields contain the calculated statistic for that field across all records in the batch.

The processor calculates the statistics for string and numeric fields only, dropping all other fields from the input record.

For example, let's say a batch contains the following data:
Address SalePrice DaysOnMarket MultipleOffers
123 Main St 175000 6 true
2385 First St 260000 20 false
985 Spruce St 300000 15 false
3480 Grove St 185000 25 true
You configure the processor to profile all fields. The processor produces the following five output records:
Summary Address SalePrice DaysOnMarket
count 4 4 4
mean null 230000.0 16.5
stddev null 60138.728508895714 8.103497187428813
min 123 Main St 175000 6
max 985 Spruce St 300000 25

Notice how the processor drops the boolean field MultipleOffers from the output record, even though the processor is configured to profile all fields.

Configuring a Profile Processor

Configure a Profile processor to calculate descriptive statistics for string and numeric data within a batch.

  1. In the Properties panel, on the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
  2. On the Profile tab, configure the following properties:
    Profile Property Description
    Profile Mode Mode used to profile data:
    • All Fields - Profile all string and numeric fields in the records.
    • Specific Fields - Profile specific string and numeric fields in the records.
    Specific Fields Fields to profile.

    Click the Add icon to specify another field to profile.