Deduplicate

The Deduplicate processor removes duplicate records from a batch.

By default, the processor evaluates entire records for duplicates, removing a record when all of the field names and values match those of another record. You can configure the processor to assess specific fields instead of the entire record. When evaluating specific fields, the processor ignores values in other fields.

For example, to remove records when a customer accidentally submits the same online order twice, you might configure the processor to evaluate the critical details of the order, such as the customer name, shipping address, payment details, and ordered items, while excluding the order ID or timestamp fields.

The Deduplicate processor is case sensitive, but is not concerned with field order.

When you configure the Deduplicate processor, you specify whether to evaluate the entire record or specified fields. When evaluating specified fields, you list the fields to use.

Tip: In streaming pipelines, you can use a Window processor upstream from this processor to generate larger batch sizes for evaluation.