Processing Changed Data
Certain stages enable you to easily process data changes, such as change capture data (CDC) or transactional data, in a pipeline.
CDC-enabled origins can read change capture data. Some exclusively read change capture
data, others can be configured to read it. When reading changed data, they determine the
CRUD operation associated with the data and include CRUD operations - such as insert,
update, upsert, or delete - in the sdc.operation.type
record header
attribute.
CRUD-enabled processors and destinations can use the CRUD
operation type in the sdc.operation.type
header attribute when
writing records, enabling the external system to perform the appropriate
operation.
Using a CDC-enabled origin and CRUD-enabled stages in a pipeline allows you to easily write changed data from one system into another. You can also use a CDC-enabled origin to write to non-CRUD destinations, and non-CDC origins to write to CRUD-enabled stages. For information on how that works, see Use Cases.
CRUD Operation Header Attribute
CDC-enabled origins read include the
sdc.operation.type
record header attribute in all records when
reading changed data.
CRUD-enabled processors and destinations can use the CRUD
operation type in the sdc.operation.type
header attribute when
writing records, enabling the external system to perform the appropriate
operation.
sdc.operation.type
record header attribute uses the following
integers to represent CRUD operations:- 1 for INSERT records
- 2 for DELETE records
- 3 for UPDATE records
- 4 for UPSERT records
- 5 for unsupported operations or codes
- 6 for UNDELETE records
- 7 for REPLACE records
- 8 for MERGE records
Earlier Implementations
Some origins were enabled for CDC using
different record header attributes in earlier releases, but they all now include the
sdc.operation.type
record header attribute. All earlier CRUD header
attributes are retained for backward compatibility.
Similarly, CRUD-enabled destinations that were enabled to look for the CRUD operation
type in other header attributes can now look for the sdc.operation.type
record header attribute first and check the alternate attribute afterwards. The
alternate header attribute functionality is retained for backward compatibility.
CDC-Enabled Stages
CDC-enabled stages provide the CRUD operation type in the
sdc.operation.type
record header attribute. Some origins provide
alternate and additional header attributes.
CDC-Enabled Stage | CRUD Record Header Attributes |
---|---|
Aurora PostgreSQL CDC Client | Either includes the CRUD operation type in the record or includes
the operation type in two record header attributes:
Includes additional CDC information in record header
attributes with the For more information, see Record Contents and Generated Records. |
CONNX CDC | Includes the CRUD operation type in the
sdc.operation.type record header attribute.
For more information, see Record Attributes. |
MapR DB CDC | Includes the CRUD operation type in the
sdc.operation.type record header attribute.
Includes additional CDC information in record header attributes. For more information, see CRUD Operation and CDC Header Attributes. |
MongoDB Atlas CDC | Includes the CRUD operation type in the
sdc.operation.type record header attribute.
Can include additional CDC information in record header
attributes, such as the For more information, see Generated Records. |
MongoDB Oplog | Includes the CRUD operation type in the
sdc.operation.type record header attribute.
Can include additional CDC information in record header
attributes, such as the For more information, see Generated Records. |
MySQL Binary Log | Includes the CRUD operation type in the
sdc.operation.type record header attribute.
Includes additional CDC information in record fields. For more information, see Generated Records. |
Oracle CDC | Includes the CRUD operation type in both of the following
headers:
For more information, see CRUD Operation Header Attributes. Includes additional CDC information in record header
attributes with the |
Oracle CDC Client | Includes the CRUD operation type in both of the following
headers:
For more information, see CRUD Operation Header Attributes. Includes additional CDC information in record header
attributes with the |
Oracle Multitable Consumer | Includes the CRUD operation type in both of the following
headers:
For more information, see CRUD Operation Header Attributes. |
PostgreSQL CDC Client | Either includes the CRUD operation type in the record or includes
the operation type in two record header attributes:
Includes additional CDC information in record header
attributes with the For more information, see Record Contents and Generated Record |
Salesforce | Includes the CRUD operation type in the
sdc.operation.type record header attribute.
For more information, see CRUD Operation Header Attribute. |
SQL Parser | Includes the CRUD operation type in both of the following
headers:
For more information, see Generated Records. |
SQL Server CDC Client | Includes the CRUD operation type in the
sdc.operation.type record header attribute.
Includes CDC information in header attributes named
|
SQL Server Change Tracking | Includes the CRUD operation type in the
sdc.operation.type record header attribute.
Includes additional information from change tables in the
|
Data Changes in SQL Server or Azure SQL Database
Method Used to Track Changes | Data Collector Origin |
---|---|
CDC tables | SQL
Server CDC Client
For more information about CDC tables, see the Microsoft documentation. |
Change tracking tables | SQL Server Change Tracking
For more information about change tracking tables, see the Microsoft documentation. |
Temporal tables | JDBC Multitable Consumer or JDBC
Query Consumer For more information about temporal tables, see the Microsoft documentation. |
CRUD-Enabled Stages
CRUD-Enabled Stage | Supported Operations | Stage Processing |
---|---|---|
JDBC Tee processor |
|
Determines the operation to use based on:
A Change Log property enables processing records based on the CDC-enabled origin in the pipeline. For more information, see CRUD Operation Processing. |
Azure Synapse SQL destination |
|
Determines the operation to use based on:
For more information, see CRUD Operation Processing. |
Couchbase destination |
|
Determines the operation to use based on:
For more information, see CRUD Operation Processing. |
Databricks Delta Lake destination |
|
Determines the operation to use based on:
For more information, see CRUD Operation Processing. |
Elasticsearch destination |
|
Determines the operation to use based on:
For more information, see CRUD Operation Processing. |
Google BigQuery destination |
|
Determines the operation to use based on:
For more information, see CRUD Operation Processing. |
JDBC Producer destination |
|
Determines the operation to use based on:
A Change Log Format property enables processing records based on the CDC-enabled origin in the pipeline. For more information, see CRUD Operation Processing. |
Kudu destination |
|
Determines the operation to use based on:
For more information, see CRUD Operation Processing. |
MapR DB JSON destination |
|
Determines the operation to use based on:
For more information, see Writing to MapR DB JSON. |
MongoDB destination |
|
Determines the operation to use based on:
For more information, see Define the CRUD Operation. |
MongoDB Atlas destination |
|
Determines the operation to use based on:
For more information, see Define the CRUD Operation. |
Oracle destination |
|
Determines the operation to use based on:
For more information, see CRUD Operation Processing. |
Redis destination |
|
Determines the operation to use based on:
For more information, see Define the CRUD Operation. |
Salesforce destination |
|
Determines the operation to use based on:
For more information, see CRUD Operation Processing. |
Salesforce Bulk API 2.0 destination |
|
Determines the operation to use based on:
For more information, see CRUD Operation Processing. |
SingleStore destination |
|
Determines the operation to use based on:
A Change Log Format property enables processing records based on the CDC-enabled origin in the pipeline. For more information, see CRUD Operation Processing. |
Snowflake destination |
|
Determines the operation to use based on:
For more information, see CRUD Operation Processing. |
Teradata destination |
|
Determines the operation to use based on:
For more information, see CRUD Operation Processing. |
Processing the Record
Change logs can provide record data in different formats. The JDBC Tee processor and JDBC Producer destination can decode most change log formats to generate record data based on the origin change log. When using other CRUD-enabled destinations, you might need to add additional processing to the pipeline to alter the format of the record.
For example, Microsoft SQL CDC records created by the JDBC Query Consumer origin contains CDC fields in the record, in addition to record data. You might use a Field Remover processor to drop any unnecessary fields from the record.
In contrast, the MySQL Server binary logs read by the My SQL Binary Log origin provides new or updated data in a New Data map field and changed or deleted data in a Changed Data map field. You might want to use the Field Flattener processor to flatten the map field with the data that you need, and a Field Remover processor to remove any unnecessary fields.
For details on the format of generated records, see the documentation for the CDC-enabled origin.
Use Cases
- CDC-enabled origin with CRUD-enabled destinations
- You can use a CDC-enabled origin and a CRUD-enabled destination to easily process changed records and write them to a destination system.
- CDC-enabled origin to non-CRUD destinations
-
If you need to write changed data to a destination system without a CRUD-enabled destination, you can use an Expression Evaluator processor or any scripting processor to move the CRUD operation information from the
sdc.operation.type
header attribute to a field, so the information is retained in the record.For example, say you want to read from Oracle LogMiner redo logs and write the records to Hive tables with all of the CDC information in record fields. To do this, you'd use the Oracle CDC Client origin to read the redo logs, then add an Expression Evaluator to pull the CRUD information from the
sdc.operation.type
header attribute into the record. Oracle CDC Client writes additional CDC information, such as the table name and SCN, intooracle.cdc
header attributes, so you can use expressions to pull that information into the record as well. Then you can use the Hadoop FS destination to write the enhanced records to Hive. - Non-CDC origin to CRUD destinations
- When reading data from a non-CDC origin, you can use an Expression Evaluator processor or any scripting
processor to define the
sdc.operation.type
header attribute. e