Profile
The Profile processor calculates descriptive statistics for string and numeric data. Use the Profile processor to help you profile and understand data.
The processor calculates count, mean, standard deviation, minimum, and maximum statistics across all records in the batch. The processor calculates the statistics for string and numeric fields only, ignoring all other fields in the record.
The processor generates a total of five output records for each batch, one record for each calculated statistic. Each output record includes a summary field that lists the type statistic calculated for the record. The remaining fields contain the calculated statistic for that field across all records in the batch.
When you configure the Profile processor, you define whether the processor profiles all fields or specific fields in each record.
Profile Statistics
- Count
- Returns the number of non-null values for the field. The processor does not include null values in the count.
- Mean
- Returns the mean value of the field. When calculating the mean of a string field, the processor returns a null value.
- Standard deviation
- Returns the standard deviation value of the field. When calculating the standard deviation of a string field, the processor returns a null value.
- Minimum
- Returns the minimum value of the field. When calculating the minimum value of a string field, the processor alphabetizes the values and returns the first value in the alphabetized list.
- Maximum
- Returns the maximum value of the field. When calculating the maximum value of a string field, the processor alphabetizes the values and returns the last value in the alphabetized list.
Output Records
- Count
- Mean
- Standard deviation
- Minimum
- Maximum
Each output record includes a summary field that lists the type of statistic calculated for the record. The remaining fields contain the calculated statistic for that field across all records in the batch.
The processor calculates the statistics for string and numeric fields only, dropping all other fields from the input record.
Address | SalePrice | DaysOnMarket | MultipleOffers |
---|---|---|---|
123 Main St | 175000 | 6 | true |
2385 First St | 260000 | 20 | false |
985 Spruce St | 300000 | 15 | false |
3480 Grove St | 185000 | 25 | true |
Summary | Address | SalePrice | DaysOnMarket |
---|---|---|---|
count | 4 | 4 | 4 |
mean | null | 230000.0 | 16.5 |
stddev | null | 60138.728508895714 | 8.103497187428813 |
min | 123 Main St | 175000 | 6 |
max | 985 Spruce St | 300000 | 25 |
Notice how the processor drops the boolean field MultipleOffers
from the
output record, even though the processor is configured to profile all fields.
Configuring a Profile Processor
Configure a Profile processor to calculate descriptive statistics for string and numeric data within a batch.
-
In the Properties panel, on the General
tab, configure the following properties:
General Property Description Name Stage name. Description Optional description. -
On the Profile tab, configure the following
properties:
Profile Property Description Profile Mode Mode used to profile data: - All Fields - Profile all string and numeric fields in the records.
- Specific Fields - Profile specific string and numeric fields in the records.
Specific Fields Fields to profile. Click the Add icon to specify another field to profile.