Deduplicate
The Deduplicate processor removes duplicate records from a batch.
By default, the processor evaluates entire records for duplicates, removing a record when all of the field names and values match those of another record. You can configure the processor to assess specific fields instead of the entire record. When evaluating specific fields, the processor ignores values in other fields.
For example, to remove records when a customer accidentally submits the same online order twice, you might configure the processor to evaluate the critical details of the order, such as the customer name, shipping address, payment details, and ordered items, while excluding the order ID or timestamp fields.
The Deduplicate processor is case sensitive, but is not concerned with field order.
Configuring a Deduplicate Processor
-
In the Properties panel, on the General tab, configure the
following properties:
General Property Description Name Stage name. Description Optional description. Cache Data Caches data processed for a batch so the data can be reused for multiple downstream stages. Use to improve performance when the stage passes data to multiple stages. Caching can limit pushdown optimization when the pipeline runs in ludicrous mode.
-
On the Deduplicate tab, configure the following
options:
Deduplicate Property Description Evaluate Determines how the processor evaluates records: - All Fields - Evaluates all fields, removing only records with the exact same fields and values. The processor removes records that have the same values for all fields.
- Specified Fields - Evaluates only the specified fields for matching values. The processor removes records that have the same values for every specified field.
The Deduplicate processor is case sensitive, but is not concerned with field order.
Fields to Evaluate One or more fields to evaluate for duplicate values when evaluating specified fields. Click the Add icon to add additional fields.