The SQL Server CDC Client origin processes data in Microsoft SQL Server change data capture (CDC) tables. The origin uses multiple threads to enable parallel processing of data.
Use the SQL Server CDC Client origin to generate records from CDC tables. To read data from Microsoft SQL Server change tracking tables, use the SQL Server Change Tracking origin. For more information about the differences between CDC and change tracking data, see the Microsoft SQL Server documentation.
The SQL Server CDC Client origin includes the CRUD operation type in a record header attribute so generated records can be easily processed by CRUD-enabled destinations. For an overview of Data Collector changed data processing and a list of CRUD-enabled destinations, see Processing Changed Data.
You might use this origin to perform database replication. You can use a separate pipeline with the JDBC Query Consumer or JDBC Multitable Consumer origin to read existing data. Then start a pipeline with the SQL Server CDC Client origin to process subsequent changes.
When you configure the origin, you specify the SQL Server capture instance names - the origin processes the related CDC tables. You can define groups of tables in the same database and any initial offsets to use. When you omit initial offsets, the origin processes all available data in the CDC tables.
You can enable late table processing to allow the origin to process tables that appear after the pipeline starts. You can also configure the origin to check for schema changes in processed tables and to generate an event after discovering a change.
To determine how the origin connects to the database, you specify connection information, a query interval, number of retries, and any custom JDBC configuration properties that you need.
To use a JDBC version older than 4.0, you can specify the driver class name and define a health check query.
The origin can generate events for an event stream. For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.
Before you use the SQL Server CDC Client origin, install the JDBC driver for the database. You cannot access the database until you install the required driver.
For information about installing additional drivers, see Install External Libraries.
The SQL Server CDC Client origin supports the SQL Server insert and delete operations. Updates captured after the update operation are treated as update, and updates captured before the update operation are treated as an unsupported operation.
The SQL Server CDC Client origin performs parallel processing and enables the creation of a multithreaded pipeline.
When you start the pipeline, the SQL Server CDC Client origin retrieves the list of CDC tables associated with the source tables defined in the table configurations. The origin then uses multiple concurrent threads based on the Number of Threads property. Each thread reads data from a single table.
As the pipeline runs, each thread connects to the origin system and creates a batch of data, and passes the batch to an available pipeline runner. A pipeline runner is a sourceless pipeline instance - an instance of the pipeline that includes all of the processors and destinations in the pipeline and performs all pipeline processing after the origin.
Each pipeline runner processes one batch at a time, just like a pipeline that runs on a single thread. When the flow of data slows, the pipeline runners wait idly until they are needed.
Multithreaded pipelines preserve the order of records within each batch, just like a single-threaded pipeline. But since batches are processed by different pipeline instances, the order that batches are written to destinations is not ensured.
For more information about multithreaded pipelines, see Multithreaded Pipeline Overview.
Say you are reading from 10 tables. You set the Number of Threads property to 5 and the Maximum Pool Size property to 6. When you start the pipeline, the origin retrieves the list of tables. The origin then creates five threads to read from the first five tables, and by default Data Collector creates a matching number of pipeline runners. Upon receiving data, a thread passes a batch to each of the pipeline runners for processing.
At any given moment, the five pipeline runners can each process a batch, so this multithreaded pipeline processes up to five batches at a time. When incoming data slows, the pipeline runners sit idle, available for use as soon as the data flow increases.
Each origin thread creates a batch of data from a single table. You can define one of the following strategies that the threads use to create each batch:
When you configure SQL Server CDC Client, you can define multiple CDC tables using a single set of table configuration properties. You can also define multiple table configurations to process multiple groups of CDC tables.
<capture instance name pattern>
Use this format when CDC tables are created based on capture instance names. You can use the pattern to process a full set of CDC tables or to exclude some CDC tables from processing.
For example, say you have a Sales.Accounts table with a CDC table named Sales_Accounts_CT. After adding several columns to the table, you create a new CDC table called Sales_Accounts2_CT.
To process both CDC tables, you can specify the following capture instance name pattern: Sales_Accounts%. To process only the CDC data that occurred after the schema change, you can specify the following capture instance name: Sales_Accounts2.
<schema name>_<data table name pattern>
Use this format when CDC tables are created based on data tables instead of capture instance names.
For example, to process all available CDC tables for data tables in a Sales schema, you might use Sales_%. Or, to process the CDC tables associated with a set of data tables with the Transact prefix, you might use Sales_Transact%.
<schema name>_%
For example, to process all tables in the sales schema, enter sales_%.
For example, say you want to process all CDC tables in the schema except for those that start with "dept". You can use the default % for the table name pattern, and enter dept* for the table exclusion pattern.
For more information about using regular expressions with Data Collector, see Regular Expressions Overview.
You can define the initial order that the origin uses to read the tables.
The origin uses the table order strategy only for the initial reading of the tables. When threads switch back to previously read tables, they read from the next available table, regardless of the defined order.
You can configure the SQL Server CDC Client to process data in CDC tables that appear after the pipeline starts.
When you allow late table processing, the SQL Server CDC Client origin uses a background thread to check for late CDC tables. The origin checks at regular user-defined intervals.
You can configure the SQL Server CDC Client origin to check for schema changes in the tables being processed. When checking for schema changes, the origin includes a schema check statement in the SQL query.
The origin generates a schema-change event each time that it finds a schema change: one for each table with a schema change.
When the SQL Server CDC Client origin generates a record, it places the CDC information, such as the CDC operation and start LSN values, in record header attributes. And it places row data provided by the CDC tables into the fields of the generated records.
The SQL Server CDC Client origin generates JDBC record header attributes that provide the SQL Server CDC data for each record, such as the start or end log sequence numbers (LSN).
The origin also includes the sdc.operation.type attribute and information from the SQL Server CDC tables. The SQL Server CDC Client header attributes are prefixed with "jdbc". The names of the SQL Server CDC column names are included in the header attribute name, as follows: jdbc.<CDC column name>.
You can use the record:attribute or record:attributeOrDefault functions to access the information in the attributes. For more information about working with record header attributes, see Working with Header Attributes.
| Header Attribute Name | Description |
|---|---|
| sdc.operation.type |
The origin uses the following values to represent the operation
type:
|
| jdbc.tables | Provides a comma-separated list of source
tables for the fields in the record. Note: Not all JDBC drivers
provide this information.
Oracle uses all caps for schema, table, and column names by default. Names can be lower- or mixed-case only if the schema, table, or column was created with quotation marks around the name. |
| jdbc.<column name>.jdbcType | Provides the original SQL data type for each field in the record.
Because the record read from the SQL Server CDC table includes CDC columns, the generated record also includes corresponding jdbc.<column name>.jdbcType header attributes for those columns. For example, since the original data includes a __$start_lsn column, the resulting record has a jdbc.__$start_lsn.jdbc.Type header attribute. It also has a jdbc.__$start_lsn attribute that is generated separately by the origin, as described below. |
| jdbc.<column name>.jdbc.precision | Provides the original precision for all numeric and decimal fields. |
| jdbc.<column name>.jdbc.scale | Provides the original scale for all numeric and decimal fields. |
| jdbc. __$command_id | Data from the SQL Server CDC __$command_id column. |
| jdbc.__$end_lsn | Data from the SQL Server CDC __$end_lsn column. |
| jdbc.__$operation | The CRUD operation type using SQL Server codes, as defined in the SQL Server CDC __$operation column. |
| jdbc.__$seqval | Data from the SQL Server CDC __$seqval column. |
| jdbc.__$start_lsn | Data from the SQL Server CDC __$start_lsn column. |
| jdbc.__$update_mask | Data from the SQL Server CDC __$update_mask column. |
For details about the CDC attributes, see the SQL Server documentation.
If you use a CRUD-enabled destination in the pipeline such as JDBC Producer or Elasticsearch, the destination can use the operation type when writing to destination systems. When necessary, you can use an Expression Evaluator or scripting processors to manipulate the value in the sdc.operation.type header attribute. For an overview of Data Collector changed data processing and a list of CRUD-enabled destinations, see Processing Changed Data.
The SQL Server CDC Client origin can generate events that you can use in an event stream. When you enable event generation, the origin generates an event when it completes processing the data returned by the specified queries for all tables.
If you enable schema change event generation, the origin also generates an event each time it finds a schema change.
When you restart a pipeline stopped by the Pipeline Finisher executor, the origin continues processing from the last-saved offset unless you reset the origin.
For an example, see Case Study: Stop the Pipeline.
For an example, see Case Study: Event Storage.
For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.
| Record Header Attribute | Description |
|---|---|
| sdc.event.type | Event type. Uses one of the following types:
|
| sdc.event.version | An integer that indicates the version of the event record type. |
| sdc.event.creation_timestamp | Epoch timestamp when the stage created the event. |
The origin generates a no-more-data event record when the origin completes processing all available data and the number of seconds configured for Batch Wait Time elapses without any new files appearing to be processed.
The no-more-data event record generated by the origin has the sdc.event.type set to no-more-data and does not include any additional fields.
The origin generates a schema-change event record only when you enable the origin to check for schema changes, and the origin discovers a schema change.
| Event Record Field | Description |
|---|---|
| capture-instance-name | The name of the capture instance or CDC table associated with the table with the schema change. |
| source-table-schema-name | The name of the schema that contains the data table. |
| source-table-name | The name of the data table that has a schema changes. |