Record Deduplicator

Data Collector

The Record Deduplicator evaluates records for duplicate data and routes data to two streams - one for unique records and one for duplicate records. Use the Record Deduplicator to discard duplicate data or route duplicate data through different processing logic.

The Record Deduplicator can compare entire records or a subset of fields. Use a subset of fields to focus the comparison on fields of concern. For example, to discard purchases that are accidentally submitted more than once, you might compare information about the purchaser, selected items, and shipping address, but ignore the timestamp of the event.

To enhance pipeline performance, the Record Deduplicator hashes comparison fields and uses the hashed values to evaluate for duplicates. On rare occasions, hash functions can generate collisions that can cause records to be incorrectly treated as duplicates.