Deduplicate

The Deduplicate processor removes duplicate records from a batch.

By default, the processor evaluates entire records for duplicates, removing a record when all of the field names and values match those of another record. You can configure the processor to assess specific fields instead of the entire record. When evaluating specific fields, the processor ignores values in other fields.

For example, to remove records when a customer accidentally submits the same online order twice, you might configure the processor to evaluate the critical details of the order, such as the customer name, shipping address, payment details, and ordered items, while excluding the order ID or timestamp fields.

The Deduplicate processor is case sensitive, but is not concerned with field order.

When you configure the Deduplicate processor, you specify whether to evaluate the entire record or specified fields. When evaluating specified fields, you list the fields to use.

Tip: In streaming pipelines, you can use a Window processor upstream from this processor to generate larger batch sizes for evaluation.

Configuring a Deduplicate Processor

Configure a Deduplicate processor to remove duplicate records from a batch.

In the Properties panel, on the General tab, configure the following properties:


General Property	Description
Name	Stage name.
Description	Optional description.
Cache Data	Caches data processed for a batch so the data can be reused for multiple downstream stages. Use to improve performance when the stage passes data to multiple stages. Caching can limit pushdown optimization when the pipeline runs in ludicrous mode.

On the Deduplicate tab, configure the following options:


Deduplicate Property	Description
Evaluate	Determines how the processor evaluates records: All Fields - Evaluates all fields, removing only records with the exact same fields and values. The processor removes records that have the same values for all fields. Specified Fields - Evaluates only the specified fields for matching values. The processor removes records that have the same values for every specified field. The Deduplicate processor is case sensitive, but is not concerned with field order.
Fields to Evaluate	One or more fields to evaluate for duplicate values when evaluating specified fields. Click the Add icon to add additional fields.