Oracle CDC
The Oracle CDC origin processes change data capture (CDC) information stored in redo logs and accessed using Oracle LogMiner. For information about supported versions, see Supported Systems and Versions in the Data Collector documentation.
The Oracle CDC origin provides the same functionality as the original Oracle CDC Client origin in a more efficient package. It also provides functionality not available in the Oracle CDC Client origin. For more information, see Comparing Oracle CDC Origins.
The Oracle CDC origin can process changes for primary and standby databases. It processes data based on the system change number, in ascending order, and buffers changes in memory. The origin captures and adjusts to schema changes.
The origin includes CDC and CRUD information in record header attributes so generated records can be easily processed by CRUD-enabled destinations. For an overview of Data Collector changed data processing and a list of CRUD-enabled destinations, see Processing Changed Data.
You might use this origin to track the data that has changed in your database in order to run tasks using the changed data. With this origin, you can implement data integration processes based on the identification, capture, and delivery of the changes made to the database.
When you configure the origin, you specify change data capture details, such as the schemas and tables to read from, how to read the initial change, and the operations to include. You specify an initial change or time to start processing, transaction details, and uncommitted transaction handling.
You specify the record format to generate, basic or rich. And you can optionally include a wide range of information in the record or record header, such as the redo statement, null columns, and old and new values for the record. You can define resilience properties, such as how many retries to perform for a task and how long to wait between retries.
You can specify JDBC connection information and user credentials. You can also use a connection to configure the origin. If the schema was created in a pluggable database, you state the pluggable database name. You can also define custom properties that the driver requires or supports.
The origin can generate events for an event stream. For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.
Comparing Oracle CDC Origins
Data Collector provides two origins that use LogMiner to process Oracle change data. The Oracle CDC Client origin was the first origin. This new Oracle CDC origin includes key functionality available in the older origin, as well as improvements and additional new features.
If you are new to Data Collector or building a new pipeline that processes Oracle change data, use the newer Oracle CDC origin for a wider range of functionality.
If you already have pipelines that use the older Oracle CDC Client origin, you can continue using that origin. However, you might consider updating the pipeline to use the newer Oracle CDC origin for access to additional functionality.
- Similarities
-
- The newer Oracle CDC origin can generate the same records as those
generated by the older Oracle CDC Client origin.
However, the newer origin has streamlined and reorganized properties, so you configure the origin differently.
- Both origins require completing similar prerequisite tasks.
- Both origins place the operation type in the
sdc.operation.type
record header attribute and can generate additional attributes. - You can configure both origins to perform multithreaded parsing.
- Both origins can generate events that can be used in dataflow trigger pipelines.
- The newer Oracle CDC origin can generate the same records as those
generated by the older Oracle CDC Client origin.
- Key differences
-
- The newer Oracle CDC origin can process changes from primary and standby databases. The older Oracle CDC Client origin only supports primary databases.
- The newer Oracle CDC origin receives raw values from LogMiner that
it then efficiently converts to the corresponding data values. As a
result, the newer origin does not support processing encrypted data.
When Transparent Data Encryption is enabled for columns or tables in
the Oracle database, that data remains encrypted when processed by
the newer origin.
The older Oracle CDC Client origin can process encrypted data because it receives the decrypted values from Oracle.
- The newer origin provides two record formats for processed data.
- The following options and properties from the older origin are not
included in the Oracle CDC origin because the newer origin uses the
most efficient option, rather than offering a choice between optimal
and sub-optimal options:
- Parser type - The older origin uses the ALL(*) parser by
default because it was the only parser available when the
origin was created. When the more performant PEG parser
became available, it was added to the older origin as an
advanced option.
The newer origin uses the more efficient PEG parser.
- Buffer type and location - The older origin provides several
buffering options.
The newer origin is more efficient than the older origin, so it always buffers data locally, in memory. However, you can configure the maximum buffer size and a buffer wait strategy.
- Dictionary source - The older origin lets Oracle retrieve
dictionary definitions from the online catalog or from redo
logs, both of which can be problematic.
The newer origin loads dictionary definitions from the Oracle catalog. While the pipeline runs, it tracks dictionary changes from redo logs. To support pipeline stops and starts, the origin stores the dictionary state in the offset so there is no dependency on dictionaries extracted to redo logs or on LogMiner transforming the original redo log data. This allows for consistent matching from column values to column data types across time.
- Schema and table specification - When specifying the schemas
and tables to monitor, the older origin supports SQL LIKE
syntax for inclusion patterns and Java regular expression
syntax for exclusion patterns.
The newer origin uses REGEXP_LIKE syntax for all schema and table patterns to provide a richer pattern specification model. This also avoids problems with commonly-used characters in schema and table names, such as the underscore ( _ ), that are also wildcards.
- Truncate operations - For Truncate operations, the older
origin only generated events.
The newer Oracle CDC origin also generates events for Truncate operations. However, when configured to use the rich record format, the origin can create records for Truncate operations as well. This enables you to understand when the truncation occurred in relationship to processed data.
- Parser type - The older origin uses the ALL(*) parser by
default because it was the only parser available when the
origin was created. When the more performant PEG parser
became available, it was added to the older origin as an
advanced option.
Updating Oracle CDC Client Pipelines
The newer Oracle CDC origin provides the same functionality as the older Oracle CDC Client origin, with additional enhancements. As a result, you can update existing pipelines to use the new origin if you like. The older origin remains available, so updating existing pipelines is not required.
As you configure the replacement origin, you might reconsider how you want to process Oracle change data. However, you can configure the origin to generate the same records, so you don't have to update the pipeline when you replace the old origin.
Below are some configuration issues to consider when updating existing Oracle CDC Client pipelines to use the newer origin:
- The newer Oracle CDC origin supports processing changes from primary or standby
databases.
The older origin only supports primary databases.
To configure the Oracle CDC origin to use a primary database, on the Oracle tab, set the Database Role property to Primary.
- The newer Oracle CDC origin can generate two different record formats:
basic or rich.
To generate records like those generated by the Oracle CDC Client origin, on the Data tab, set the Record Format property to Basic. Then, keep the default values for the custom record options on the same tab.
- The newer Oracle CDC origin supports REGEXP_LIKE syntax for specifying schemas
and tables, while the older origin supports SQL LIKE syntax for inclusion
patterns and Java regular expression syntax for exclusion patterns.Update the schema and table patterns from the older origin to use the appropriate syntax. The following tips might help:
- Inclusion patterns - If you used SQL LIKE syntax to define inclusion
patterns, replace all occurrences of the percent character ( % )
with a dot and asterisk ( .* ).
Also, replace all underscores ( _ ) that represent a single character with a dot ( . ).
- Exclusion patterns - If you used Java regular expressions syntax, you don’t need to make any changes.
- Java regular expressions - You can use Java regular expressions for both patterns. However, you might anchor them using a carrot character ( ^ ) at the start of the expression, and a dollar sign ( $ ) at the end of the expression.
- Inclusion patterns - If you used SQL LIKE syntax to define inclusion
patterns, replace all occurrences of the percent character ( % )
with a dot and asterisk ( .* ).
- By default, the newer Oracle CDC origin logs details about LogMiner sessions in
the Data Collector log, just like the older origin. However, the newer origin can also generate
events when new LogMiner sessions are created.
If you monitor Data Collector log entries to determine when the origin creates new LogMiner sessions, then you might configure the Oracle CDC origin to produce events instead. Data Collector logs are not a reliable source for automated processes.
To produce events for new LogMiner sessions, on the General tab, select Produce Events. Then, on the Oracle tab, enable the Track LogMiner Sessions property.
Prerequisites
- Configure the database archiving mode.
- Enable supplemental logging for the database or tables.
- Create a user account with the required roles and privileges.
- Install required libraries on Data Collector.
- CDB or multitenant database - A multitenant container database (CDB) that includes one or more pluggable databases (PDB).
- PDB or pluggable database - A database that is part of a CDB.
- Non-CDB database - A standalone database created on earlier versions of Oracle.
Task 1. Configure the Database Archiving Mode
LogMiner provides access to database changes stored in redo log files that summarize database activity. The origin uses this information to generate records. LogMiner requires that the database be open and in ARCHIVELOG mode with archiving enabled.
When reading from a primary database, the database must also be writable. When reading from a standby database, you might archive logs at a higher frequency if you want to approximate real-time processing. For more information, see Primary and Standby Databases.
You configure the database archiving mode differently based on whether you use an Oracle Standard, Oracle RAC, or Oracle Exadata database:
- Oracle Standard databases
- Use the following steps to determine the status of the database and to
configure the database archiving mode:
- In a SQL shell, log into the database as a user with DBA privileges.
- Check the database logging mode.
For example:
select log_mode from v$database;
If the command returns ARCHIVELOG, you can go to "Task 2. Enable Supplemental Logging".
If the command returns NOARCHIVELOG, continue with the following steps.
- Shut down the database.
For example:
shutdown immediate;
- Start up and mount the database.
For example:
startup mount;
- Enable archiving, open the database, and make it writable.
For example:
alter database archivelog; alter database open read write;
- Oracle RAC and Exadata databases
- Use the following steps to configure the database archiving mode and verify
the status of the database:
- In a Terminal session on one of the database nodes, stop and mount
the database.
For example:
srvctl stop database -d <database name> srvctl start database -d <database name> -o mount
- In a SQL shell, log into the database as a user with DBA privileges
and configure the database.
For example:
alter database archivelog;
- In the Terminal session, restart the database.
For example:
srvctl stop database -d <database name> srvctl start database -d <database name>
- In the SQL shell, check if the database is in logging mode.
For example:
select log_mode from v$database;
- In a Terminal session on one of the database nodes, stop and mount
the database.
Task 2. Enable Supplemental Logging
To retrieve data from redo logs, LogMiner requires supplemental logging at a database level, table level, or a combination of both.
- Minimal supplemental logging
- You must enable minimal supplemental logging to use the Oracle CDC origin. Enable it in your CDB for a multitenant database or in your non-CDB database.
- Identification key logging
- Enable identification key logging based on your needs:
- Identification key logging is optional depending on the data that you want to process and how you plan to use the data.
- Primary key identification key logging is required to enable proper
primary key handling and change tracking.
Depending on the structure specification of each table, primary key logging provides primary key columns, first unique index columns, or all columns to the origin for record generation.
- Full supplemental identification key logging is recommended to
provide data from all columns, those with unchanged data as well as
the primary key and changed columns.
Enable full supplemental logging to ensure that the origin can generate records that include all fields, regardless of whether they have been changed.
For details on the data included in records based on the supplemental logging type, see Parse SQL Statements, Supported Operations, and Record Attributes.
- Determine the logging that you want to enable.
- Before enabling supplemental logging, check if it is enabled for the database.
For example, you might run the following command on your CDB, then the pluggable database:
select supplemental_log_data_min, supplemental_log_data_pk, supplemental_log_data_all from v$database;
If the command returns
Yes
orImplicit
for all three columns, then minimal supplemental logging is enabled, as well as identification key and full supplemental logging. You can skip to "Task 3. Create a User Account."If the command returns
Yes
orImplicit
for the first two columns, then supplemental and identification key logging are both enabled. If this is what you want, you can skip to "Task 3. Create a User Account." - Enable minimal supplemental logging. For example, you can run the following command from your CDB or from your non-CDB database:
alter database add supplemental log data;
- Optionally, enable identification key logging, if needed.
When enabling identification key logging for a database, do so for the database that contains the schemas to be monitored.
To enable identification key logging for a pluggable database, you must also enable it for the CDB.When enabling identification key logging for individual tables, you don’t need to enable it for the CDB. According to Oracle, best practice is to enable logging for individual tables, rather than the entire CDB. For more information, see the Oracle LogMiner documentation.
For example, you can run the following command to apply the changes to just the container:Enable primary key or full supplemental identification key logging to retrieve data from redo logs. You do not need to enable both:alter session set container=<pdb>;
- To enable primary key logging
- You can enable primary key identification key logging for individual tables or all tables in the database:
- To enable full supplemental logging
- You can enable full supplemental identification key logging for individual tables or all tables in the database:
- Submit the changes.
For example:
alter system archive log current;
Task 3. Create a User Account
Create a user account to use with the origin. You need the account to access the database through JDBC.
- 12c, 18c, 19c, or 21c CDB databases
- For Oracle, Oracle RAC, and Oracle Exadata 12c, 18c, 19c, and 21c CDB
databases, create a common user account. Common user accounts are created in
cdb$root
and must use the convention:c##<name>
. - 12c, 18c, 19c, or 21c non-CDB databases
- For Oracle, Oracle RAC, and Oracle Exadata 12c, 18c, 19c, and 21c non-CDB
databases, create a user account with the necessary privileges:
- In a SQL shell, log into the database as a user with DBA privileges.
- Create the user account.
For example:
create user <user name> identified by <password>; grant create session, alter session, set container, logmining, select any dictionary, select_catalog_role, execute_catalog_role to <user name>; grant select on database_properties to <user name>; grant select on all_objects to <user name>; grant select on all_tables to <user name>; grant select on all_tab_columns to <user name>; grant select on all_tab_cols to <user name>; grant select on all_constraints to <user name>; grant select on all_cons_columns to <user name>; grant select on v_$database to <user name>; grant select on gv_$database to <user name>; grant select on v_$instance to <user name>; grant select on gv_$instance to <user name>; grant select on v_$database_incarnation to <user name>; grant select on gv_$database_incarnation to <user name>; grant select on v_$log to <user name>; grant select on gv_$log to <user name>; grant select on v_$logfile to <user name>; grant select on gv_$logfile to <user name>; grant select on v_$archived_log to <user name>; grant select on gv_$archived_log to <user name>; grant select on v_$standby_log to <user name>; grant select on gv_$standby_log to <user name>; grant select on v_$logmnr_parameters to <user name>; grant select on gv_$logmnr_parameters to <user name>; grant select on v_$logmnr_logs to <user name>; grant select on gv_$logmnr_logs to <user name>; grant select on v_$logmnr_contents to <user name>; grant select on gv_$logmnr_contents to <user name>; grant select on v_$containers to <user name>; grant select <db>.<table> to <user name>;
Repeat the final command for each table that you want to monitor.
Task 4. Install Required Libraries on Data Collector
The Oracle CDC origin connects to Oracle through JDBC. You cannot access the database until you install the required Oracle JDBC libraries on Data Collector.
ojdbc8.jar
- Oracle JDBC driver. The latest driver version is recommended, though earlier versions supported by your database are supported. When Data Collector runs on Java 11 or higher, you can alternatively useojdbc11.jar
.The drivers are also available in
ojdbc8-full.tar.gz
andojdbc11-full.tar.gz
, respectively. Extract the library before installation.These drivers and archives are available on the Oracle driver downloads page.
orai18n.jar
- Oracle globalization library. Use the latest version of this library to ensure accurate processing of international character sets.This library is available in the
ojdbc8-full.tar.gz
archive. When Data Collector runs on Java 11 or higher, you can alternatively use theojdbc11-full.tar.gz
archive.The archives are available on the Oracle driver downloads page. Extract the library before installation.
orai18n-mapping.jar
- Oracle globalization library. Use the latest version of this library to ensure accurate processing of international character sets.This library is available within a JDBC Supplement Package (ZIP) through the Oracle Instant Client downloads page. Extract the library before installation.
Install the libraries as external libraries into the JDBC Oracle stage library,
streamsets-datacollector-jdbc-oracle-lib
, which includes the origin. For more information, see Install External
Libraries in the Data Collector
documentation.
Primary and Standby Databases
- Primary database
-
You can use the Oracle CDC origin to process changes from a primary database. When processing changes from a primary database, LogMiner can process changes directly from online redo logs or from archived redo logs.
- Standby database
- You can use the Oracle CDC origin to process changes from a standby database when archived redo logs are regularly replicated. Standby databases are replica databases managed by Oracle Data Guard on physical or logical databases, or other available Oracle tools. When processing changes from a standby database, LogMiner can only process changes from archived redo logs.
Table Filters
When you configure the Oracle CDC origin, you specify table filters that define the schemas and tables with the change capture data that you want to process.
REGEXP_LIKE
syntax to define a set of
tables within a schema or across multiple schemas. For more information about
REGEXP_LIKE
syntax, see the Oracle documentation.SALES
while excluding those that end with a dash
(-) and single-character suffix. You can use the following filter configuration to
specify the tables to process: - Schema:
sales
- Table Name Pattern:
^SALES.*$
- Exclusion Pattern:
^SALES.*-.$
Start Mode
You indicate where the Oracle CDC origin should start processing by configuring the Start Mode property and related settings. You can configure the origin to start at the current change or instant in time so that it processes all available changes. Or, you can specify a specific change or time to start processing.
The origin only uses the Start Mode and any specified initial change or time when you start the pipeline for the first time, or after you reset the origin. At other times, the origin uses its last saved offset to ensure that it processes all appropriate data.
Record Formats
- Basic
- Includes root-level record fields that represent the changes defined in the Oracle redo log, and record header attributes for additional metadata.
- Rich
- Has a record structure that allows the inclusion of the old values for changed data in the record, in addition to the new values. Also includes more metadata than the basic record format.
Custom Record Options
The Oracle CDC origin provides a wide range of properties that allow you to customize the data and metadata included in generated records. For example, you can include redo and undo statements, primary key and table definitions, and the precision and scale for numeric fields in record headers, in the record body, or omit the information from the record entirely.
When the origin parses redo statements and generates rich records, you can include information such as new and old field values, hexadecimal values, or raw values in record fields or omit the information from the record.
Define record customization on the Data tab of the origin. The default values for the properties generate a record similar to the Oracle CDC Client origin.
Parse SQL Statements, Supported Operations, and Record Attributes
The Oracle CDC origin generates records for the operation types that you specify. However, the origin reads all operation types from redo logs to ensure proper transaction handling.
- Parsing SQL statements
- When the origin parses redo SQL statements, you can
configure it to create records for the following operations:
- INSERT
- DELETE
- UPDATE
- Not parsing SQL statements
- When the origin does not parse redo SQL statements, it writes each SQL
statement to a field. When the origin generates basic records, it writes SQL statements to a
oracle.cdc.query
field. When generating rich records, it writes SQL statements to ajdbc.cdc.oracle.redoStatement
field. The origin also generates field attributes that provide additional information about this field.
CRUD Operation Header Attributes
- sdc.operation.type
- The origin evaluates the operation type associated
with each entry that it processes. When appropriate, it writes the
operation type to the
sdc.operation.type
record header attribute. - oracle.cdc.operation
- The origin also writes the CRUD operation type to an
oracle.cdc.operation
attribute. This attribute was implemented in an earlier release and is supported for backward compatibility.
CDC Header Attributes
Basic Record CDC Attribute | Rich Record CDC Attribute | Description |
---|---|---|
oracle.cdc.sequence.oracle | jdbc.cdc.oracle.ordinalOracle | Includes
sequence numbers that indicate the order in which statements were processed
within the transaction. The sequence number is generated by Oracle. Use to
keep statements in order within a single transaction. Not used at this time. |
oracle.cdc.sequence.internal | jdbc.cdc.oracle.ordinalCollector | Includes sequence numbers equivalent to those in the attribute above, but
created by the origin. The sequence starts with 0 each time the pipeline
starts. You can use these values to keep statements in order within a single transaction. |
Other Header Attributes
- Primary keys
-
When a table contains a primary key, the origin includes the following record header attribute:
jdbc.primaryKeySpecification
- Provides a JSON-formatted string that lists the columns that form the primary key in the table and the metadata for those columns.For example, a table with a composite primary key contains the following attribute:jdbc.primaryKeySpecification = {{"<primary key column 1 name>": {"type": <type>, "datatype": "<data type>", "size": <size>, "precision": <precision>, "scale": <scale>, "signed": <Boolean>, "currency": <Boolean> }}, ..., {"<primary key column N name>": {"type": <type>, "datatype": "<data type>", "size": <size>, "precision": <precision>, "scale": <scale>, "signed": <Boolean>, "currency": <Boolean> } } }
A table without a primary key contains the attribute with an empty value:jdbc.primaryKeySpecification = {}
For an update operation on a table with a primary key, the origin includes the following record header attributes:jdbc.primaryKey.before.<primary key column name>
- Provides the old value for the specified primary key column.jdbc.primaryKey.after.<primary key column name>
- Provides the new value for the specified primary key column.
Note: The origin provides the new and old values of the primary key columns regardless of whether the value changes. - Decimal fields
-
When a record includes decimal fields, the origin includes the following record header attributes for each decimal field in the record:
jdbc.<fieldname>.precision
jdbc.<fieldname>.scale
You can use the record:attribute
or
record:attributeOrDefault
functions to access the information
in the attributes. For more information about working with record header attributes,
see Working with Header Attributes.
Default Field Attributes
The Oracle CDC origin generates field attributes for columns converted to the Decimal or Datetime data types in Data Collector. The attributes provide additional information about each field.
- The Oracle Number data type is converted to the Data Collector Decimal data type, which does not store scale and precision.
- The Oracle Timestamp data type is converted to the Data Collector Datetime data type, which does not store nanoseconds.
Data Collector Data Type | Generated Field Attribute | Description |
---|---|---|
Decimal | precision | Provides the original precision for every number column. |
Decimal | scale | Provides the original scale for every number column. |
Datetime | nanoSeconds | Provides the original nanoseconds for every timestamp column. |
You can use the record:fieldAttribute
or
record:fieldAttributeOrDefault
functions to access the information
in the attributes. For more information about working with field attributes, see Field Attributes.
Annotation Field Attributes
You can configure the Oracle CDC origin to generate field attributes that provide additional information about each field in the record. These attributes can help you determine why a field contains a null value.
Annotation Attribute Name | Description |
---|---|
supported | Indicates if the original column type of the field is supported
by the origin. Possible values include:
|
presence | Indicates if the corresponding column for the field was included
in the redo SQL statement. Possible values include:
|
Multithreaded Parsing
By default, the Oracle CDC origin uses multiple threads to parse SQL statements. When performing multithreaded parsing, the origin uses multiple threads to generate records from committed SQL statements in a transaction. It does not perform multithreaded processing of the resulting records.
The Oracle CDC origin uses multiple threads for parsing based on the SQL Parser Threads property. When you start the pipeline, the origin creates the number of threads specified in the property. The origin connects to Oracle, creates a LogMiner session, and processes all transactions that contain changes for monitored tables.
When the origin processes a transaction, it reads and buffers all SQL statements in the transaction to an in-memory queue and waits for statements to be committed before processing them. Once committed, the SQL statements are parsed using all available threads and the original order of the SQL statements is retained.
The resulting records are passed to the rest of the pipeline. Note that enabling multithreaded parsing does not enable multithreaded processing – the pipeline uses a single thread for reading data.
0
to use as many threads as available processing cores. Query Fetch Size
The Oracle CDC origin provides a Query Fetch Size property that determines the maximum number of rows that the origin fetches at one time. When processing a large volume of data, using larger fetch sizes will improve performance by reducing the number of database hits.
As a best practice, set the query fetch size to as large a value as your execution environment allows. For example, you should set this property to at least 10,000 rows, though when possible, 25,000 rows or higher might be optimal.
When setting high fetch sizes, make sure that Data Collector has sufficient resources to handle the load. For information about configuring the Data Collector heap size, see Java Heap Size in the Data Collector documentation.
Uncommitted Transaction Handling
You can configure how the Oracle CDC origin handles long-running, stale, or abandoned uncommitted transactions. These are transactions with some operations on monitored tables but whose commit or rollback takes longer than expected to occur.
- Transaction Maximum Duration - The maximum time one transaction is expected to
last, from transaction start to commit. If a commit or rollback is not received
within the specified duration, the transaction is marked for eviction.
A moderately large setting for this property is recommended to allow for long-running batch processes.
- Operations Maximum Gap - The maximum amount of time that can pass between two
consecutive operations within the same transaction. If the specified time passes
after the last operation without the arrival of a new operation, control, or
rollback, the transaction is marked for eviction.
A relatively small value for this property is recommended to ensure that zombie transactions do not consume memory.
Evicted Transaction Handling
You can configure how the Oracle CDC origin handles evicted transactions and the operations that arrive for a transaction after it is evicted. An evicted transaction is an uncommitted transaction that has expired based on the configuration of the Transaction Maximum Duration or Operations Maximum Gap properties.
- Transaction Maximum Retention - The maximum time that the origin keeps an
internal reference to the evicted transaction. After the time elapses, the
internal reference is removed.
If the origin receives additional operations for an evicted transaction before the internal reference is removed, the origin silently discards these operations. If the origin receives additional operations for an evicted transaction after the internal reference is removed, the origin passes the operations to the pipeline when a commit is received, treating them as if they are from a new transaction.
A moderate amount of time is recommended for this property to prevent operations from being incorrectly passed to the pipeline, without allowing unnecessary references to evicted transactions to consume memory.
- Evicted Transactions - Defines how the origin handles operations that arrive for
an evicted transaction before the internal reference is removed:
- Send Transaction Operations to Pipeline - Sends operations to the pipeline like
normal records. Use to treat the operations as if the transaction was committed.
This option can create inconsistencies in your data. Use with caution.
- Discard Transaction Operations - Discards operations. Use to treat the operations as if the transaction was rolled back.
- Send Transaction Operations to Error - Sends operations to the stage for error handling. Use to treat the operations like error records.
- Abort Pipeline - Stops the pipeline. Use to review the error and address underlying issues.
- Send Transaction Operations to Pipeline - Sends operations to the pipeline like
normal records. Use to treat the operations as if the transaction was committed.
Event Generation
The Oracle CDC origin can generate events that you can use in an event stream when the pipeline starts, when a LogMiner session starts, and when changes occur to monitored tables.
- With the Email executor to send a custom email
after receiving an event.
For an example, see Sending Email During Pipeline Processing.
- With a destination to store event information.
For an example, see Preserving an Audit Trail of Events.
For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.
Event Records
Record Header Attribute | Description |
---|---|
sdc.event.type | Event type. Uses one of the following types. The names of some event types differ depending on the record format the origin uses.
|
sdc.event.version | Integer that indicates the version of the event record type. |
sdc.event.creation_timestamp | Epoch timestamp when the stage created the event. |
The Oracle CDC origin can generate the following types of event records:
- initial-database-state
- After the pipeline starts and the origin queries the database, the origin generates an event that describes the state of the database.
- logminer-session-start
- When the origin starts a LogMiner session, the origin generates an event.
- table loaded, created, changed, and removed
- When a monitored table has been loaded, created, changed, or removed, the origin
generates an event. The event records have the
sdc.event.type
attribute set to the different values depending on the record format that the origin uses:- Table loaded -
STARTUP
for basic records andmined-table-loaded
for rich records. - Table created -
CREATE
for basic records andmined-table-created
for rich records. - Table changed -
ALTER
for basic records andmined-table-altered
for rich records. - Table removed -
DROP
for basic records andmined-table-dropped
for rich records.
- Table loaded -
- table truncated
- When a monitored table has been truncated, the origin can generate an event. To generate events, in addition to enable the origin to generate events, you must configure the Table Truncate Handling property on the Oracle tab appropriately.
Oracle Data Types
The Oracle CDC origin converts Oracle data types to Data Collector data types.
- Supported built-in data types
- The origin supports the following Oracle built-in data types:
Built-In Data Type Data Collector Data Type Binary_Double Double Binary_Float Float Char String Date Date Float Float NChar, NVarchar String Number Varies depending on the precision and scale. For details, see the numeric conversion table. Timestamp Datetime Timestamp with Local Timezone, Timestamp with Timezone Zoned_Datetime Varchar, Varchar2 String - Supported ANSI data types
- The origin supports the following ANSI data types:
ANSI Data Type Data Collector Data Type Char Varying, Character, Character Varying String Decimal Decimal Double Precision Double Int, Integer, SmallInt Varies depending on the precision and scale. For details, see the numeric conversion table. National Char, National Char Varying, National Character, National Character Varying String NChar Varying String Numeric Varies depending on the precision and scale. For details, see the numeric conversion table. Real Varies depending on the precision and scale. For details, see the numeric conversion table. - Numeric data type conversions
- Oracle numeric types are converted to Data Collector data types based on the precision and scale of the Oracle numeric type.
The following table describes how the types are converted:
Precision Scale Data Collector Data Type Not specified Not specified Decimal Not specified 0 Long Not specified Greater than 0 Decimal Less than 3 0 Byte 3-4 0 Short 5-9 0 Integer 10 or greater 0 Long Any specified value Greater than 0 Decimal
Unsupported Data Types
You can configure how the origin handles unsupported data types at the record or field level.
The Unsupported Column Data Types property defines how the origin handles records that contain unsupported data types. The property provides the following options:
- Pass the record to the pipeline - Passes the record to the pipeline. You can configure how the origin handles the fields that contain unsupported data types using the Unsupported Columns property. You can configure the origin to remove empty fields using the Null Columns property.
- Discard the record - Drops the record from the pipeline.
- Send record to error - Passes records to the stage for error handling without the unsupported data type fields.
- Abort pipeline - Stops the pipeline with an error. Use to review errors and address underlying issues.
- Keep Column with Raw Value - Includes the column as a record field with its original raw value.
- Keep Column with Null Value - Includes the column as a record field with an empty value.
- Drop Column - Omits the column from record.
- Unsupported built-in data types
- The origin does not support the following built-in data types:
- Unsupported ANSI data type
- The origin does not support the following ANSI data type:
-
Long Varchar
-
- Unsupported data type categories
- The origin does not support the following categories of data types:
- Any
- Rowid
- Spatial
- Supplied
- User Defined
- XML Data
For example, this means that the following data types are not supported:
- DBUritype
- HTTPUritype
- Rowid
- SDO_Geometry
- SDO_Georaster
- SDO_Topo_Geometry
- URowid
- XDBUritype
- XMLType
Data Preview with Oracle CDC
- Preview generates a maximum of 10 records, ignoring the Preview Batch Size property when set to a higher value.
- Preview allows up to 5 seconds for the
origin to receive records, which starts after connecting to Oracle. As a
result, when previewing a pipeline with an Oracle CDC origin, preview
ignores the configured Preview Timeout property when set to a higher
value.
For other origins, the preview timeout period begins when you start preview and includes the connection time to external systems.
Working with the Drift Synchronization Solution
If you use the Oracle CDC origin as part of a Drift Synchronization Solution for Hive pipeline, make sure to pass only records flagged for Insert to the Hive Metadata processor.
- Configure the Oracle CDC origin to process only Insert records.
- If you want to process additional record types in the pipeline, use a Stream Selector processor to route only Insert records to the Hive Metadata processor.
Configuring an Oracle CDC Origin
Configure an Oracle CDC origin to use LogMiner to process change data capture information from an Oracle database. Before you use the origin, complete the prerequisite tasks.
-
In the Properties panel, on the General tab, configure
the following properties:
General Property Description Name Stage name. Description Optional description. Produce Events Generates event records when events occur. Use for event handling. On Record Error Error record handling for the stage: - Discard - Discards the record.
- Send to Error - Sends the record to the pipeline for error handling.
- Stop Pipeline - Stops the pipeline.
-
On the Connection tab, configure the following
properties:
Connection Property Description Connection Connection that defines the information required to connect to an external system. To connect to an external system, you can select a connection that contains the details, or you can directly enter the details in the pipeline. When you select a connection, Control Hub hides other properties so that you cannot directly enter connection details in the pipeline.
Connection Type Type of connection to use: - Thin Style - Service Name
- Thin Style - System Identifier
- Net Connection Descriptor
- Net Connection Alias
- Easy Connect
Host Host name or IP address of the database. Available for Thin Style connections.
Port Port number to use. Available for Thin Style connections.
Service Name Listener-registered service name. Available for Thin Style - Service Name connections.
System Identifier Unique system identifier for the database. Available for Thin Style - System Identifier connections.
TNS Connection Connection specification using a TNSNames pattern. Available for Net Connection Descriptor connections.
TNS Alias Connection alias, as defined in your local
tnsnames.ora
file.The
tnsnames.ora
file must be available in the expected Oracle location on the Data Collector machine, typically${ORACLE_HOME}/network/admin
.Available for Net Connection Descriptor connections.
Connection URL URL for the connection. Use standard JDBC formatting. Available for Easy Connect connections.
Pluggable Database Name of the pluggable database that contains the schema you want to use. Use when the schema was created in a pluggable database. Required for schemas created in pluggable databases.
Use Credentials Enables specifying credentials. Username User account to connect to the database. Use the user account created for the origin. Common user accounts for Oracle, Oracle RAC, and Oracle Exadata 12c, 18c, 19c, and 21c CDB databases start with
c##
. For more information, see Task 3. Create a User Account.Password Password for the user account. Connection Timeout Maximum milliseconds to wait for a connection. Use a time constant in an expression to define the time increment. Default is 30 seconds, defined as follows: ${30 * SECONDS * 1000}
.Connection Properties Additional connection properties to use. To add properties, click Add and define the property name and value. Use the property names and values as expected by Oracle.
-
On the Oracle tab, configure the following
properties:
Data Property Description Tables Filter Specify inclusion and exclusion properties as needed to define the schemas and tables to monitor. You can use REGEXP_LIKE
syntax to define the patterns:- Schema Inclusion Pattern - Pattern describing the
schema names to include in processing.
Default is
^.*$
. - Schema Exclusion Pattern - Pattern describing the
schema names to exclude from processing.
Default is
^$
. - Table Exclusion Pattern - Pattern describing the
table names to include in processing.
Default is
^.*$
. - Table Exclusion Pattern - Pattern describing the
table names to exclude from processing.
Default is
^$
.
Using simple or bulk edit mode, click the Add icon to define another table configuration.
Start Mode Starting point for processing. When you start the pipeline for the first time, the origin starts processing from the specified initial change or time. The origin only uses the specified initial change or time again when you reset the origin. Use one of the following options:
- From Current Change/Instant - Processes changes that arrive after you start the pipeline.
- From Specified Change - Processes changes starting from the specified system change number (SCN).
- From Specified Instant - Processes changes starting
from the specified date.
You cannot use this mode when your database honors daylight savings time.
Initial System Change Number System change number (SCN) to start reading from when you start the pipeline. If the SCN cannot be found in the redo logs, the origin continues reading from the next higher SCN that is available in the redo logs. Available when Start Mode is set to From Specified Change.
Initial Instant Datetime to read from when you start the pipeline. For a date-based initial change. Use the following format: YYYY-MM-DD HH24:MM:SS
You cannot use this mode when your database honors daylight savings time.
Available when Start Mode is set to From Specified Instant.
Window Mode Method used to advance the LogMiner window: - Amount of Changes
- Amount of Time
Window Length (changes) Number of changes to use as the LogMiner window length. Available when using the Amount of Changes window mode.
Window Length (time) Amount of time to use as the LogMiner window length, in milliseconds. Available when using the Amount of Time window mode.
Transaction Maximum Duration (ms) Maximum duration to allow from the transaction start time until the transaction is evicted, in milliseconds. Affects how the origin handles uncommitted transactions.
Set to -1 for no limit.
Transaction Maximum Retention (ms) Maximum time to retain an evicted transaction, in milliseconds. Affects how the origin handles evicted transactions.
Set to -1 for no limit.
Operations Maximum Gap (ms) Maximum time to allow between consecutive operations before a transaction is evicted, in milliseconds. Affects how the origin handles uncommitted transactions.
Set to -1 for no limit.
Maximum Wait for Archive Logs (ms) Maximum time to wait for archive logs.
Available when Database Role is set to Standby.
Database Role Role of the database: primary or standby. SQL Operations SQL operations to generate records for. Table Truncate Handling Method for handling Truncate statements in redo logs: - As Record - Generates a record, like a regular operation.
- As Event - Generates an event if the origin is configured to generate events.
- As Record and Event - Generates a record. Also generates an event if the origin is configured to generate events.
- Ignore - Ignores Truncate statements.
When using the Basic record format, As Event is the only valid option.
Default is As Event.
Case Sensitive Names Submits case-sensitive schema, table, and column names. When not selected, the origin submits names in all caps. Oracle uses all caps for schema, table, and column names by default. Names can be lower- or mixed-case only if the schema, table, or column was created with quotation marks around the name.
Track LogMiner Sessions Includes LogMiner session details and events in Data Collector log files. When the origin is configured to generate events, the origin also generates a logminer-session-start event each time a new LogMiner session starts.
- Schema Inclusion Pattern - Pattern describing the
schema names to include in processing.
-
On the Data tab, you can configure the following options.
The default values produce a record similar to that of the older Oracle CDC Client origin. For more information, see Comparing Oracle CDC Origins and Updating Oracle CDC Client Pipelines.
Options Property Description Record Format Record format to use. - Basic - Includes data and metadata about processed
changes.
Provides a record similar to the Oracle CDC Client origin that can be written using CRUD-enabled destinations.
- Rich - Includes new and old values for data and
additional metadata about processed changes.
Use when your pipeline logic requires both old and new values for changed data.
Default is Basic.
Evicted Transactions Behavior for evicted transactions: - Send Transaction Operations to Pipeline - Sends operations to the pipeline like
normal records. Use to treat the operations as if the transaction was committed.
This option can create inconsistencies in your data. Use with caution.
- Discard Transaction Operations - Discards operations. Use to treat the operations as if the transaction was rolled back.
- Send Transaction Operations to Error - Sends operations to the stage for error handling. Use to treat the operations like error records.
- Abort Pipeline - Stops the pipeline. Use to review the error and address underlying issues.
Default is Send Transaction Operations to Error.
Available when Parse Statements is enabled.
Parse Statements Parses redo SQL statements and places resulting values in records. This property is enabled by default.
Unreadable Column Values Behavior when column values cannot be parsed: - Send Record to Pipeline - Includes the field names for unreadable columns in records, with no data. You can configure the origin to remove empty fields using the Null Columns property.
- Discard Record - Drops the record from the pipeline.
- Send Record to Error - Passes records to the stage for error handling without the unreadable columns.
- Abort Pipeline - Stops the pipeline with an error. Use to review errors and address underlying issues.
Default is Send Record to Error.
Available when Parse Statements is enabled.
Unsupported Column Data Types Behavior for columns with unsupported data types: - Pass the record to the pipeline - Passes the record to the pipeline. You can configure how the origin handles the fields that contain unsupported data types using the Unsupported Columns property. You can configure the origin to remove empty fields using the Null Columns property.
- Discard the record - Drops the record from the pipeline.
- Send record to error - Passes records to the stage for error handling without the unsupported data type fields.
- Abort pipeline - Stops the pipeline with an error. Use to review errors and address underlying issues.
Default is Send Record to Error.
Available when Parse Statements is enabled.
Missing Columns Behavior when columns are missing from the redo SQL statement: - Send Empty Column - Includes empty fields in records.
- Discard Column - Omits fields from records.
Default is Discard Column.
Available when Parse Statements is enabled.
Null Columns Behavior when fields include null or missing values: - Send Empty Column - Includes empty fields in records.
- Discard Column - Omits fields from records.
Default is Discard Column.
Available when Parse Statements is enabled.
Unsupported Columns Action to take on columns that contain unsupported data types: - Keep Column with Raw Value - Includes the column as a record field with its original raw value.
- Keep Column with Null Value - Includes the column as a record field with an empty value.
- Drop Column - Omits the column from record.
Default is Keep Column with Raw Value.
Available when Parse Statements is enabled.
Only used when Unsupported Column Data Types is set to Pass the record to the pipeline.
Annotate Column Availability Includes additional attributes about column availability and type support. This property is not selected by default.
Available when Parse Statements is enabled.
Change Control Values Location of change control value information, such as the operation type, in records: - In Body
- In Header - Use to allow CRUD-enabled destinations to write records based on the operation type.
- Omit
Default is In Header.
Redo Statement Location of the redo statement in records: - In Body
- In Header
- Omit
Default is In Header.
Undo Statement Location of the undo statement in records: - In Body
- In Header
- Omit
Default is In Header.
Table Definition Location of table definitions in records: - In Body - Valid only when Record Format is set to Rich.
- In Header
- Omit
Default is Omit.
Primary Key Definition Location of primary key definitions in records: - In Body
- In Header
- Omit
Default is In Header.
Numeric Columns Digit Structure Location of precision and scale for numeric fields in records: - In Header and Body
- In Body
- In Header
- Omit
Default is In Header and Body.
New Values Location of new values in records: - In Body - Use to include changed and new values in records.
- Omit
Default is In Body.
Available when Parse Statements is enabled.
Old Values Location of old values in records: - In Body - Valid only when Record Format is set to Rich.
- Omit
Default is Omit.
Available when Parse Statements is enabled.
New Hexadecimal Values Location of hexadecimal values of all new or changed data in records: - In Body - Valid only when Record Format is set to Rich.
- Omit
Available when Parse Statements is enabled.
Old Hexadecimal Values Location of old hexadecimal values in records: - In Body - Valid only when Record Format is set to Rich.
- Omit
Available when Parse Statements is enabled.
New Raw Values Location of raw values of all new or changed data in records: - In Body - Valid only when Record Format is set to Rich.
- Omit
Available when Parse Statements is enabled.
Old Raw Values Location of old raw values in records: - In Body - Valid only when Record Format is set to Rich.
- Omit
Available when Parse Statements is enabled.
Primary Key New Values Location of new primary key values in records: - In Body
- In Header
- Omit
Default is In Header.
Available when Parse Statements is enabled.Primary Key Old Values Location of old primary key values in records: - In Body - Valid only when Record Format is set to Rich.
- In Header
- Omit
Available when Parse Statements is enabled.
Oracle's Pseudo Columns New Values Location of new values for pseudocolumns in records: - In Extended Body - New pseudocolumn values are
included in a
/newPseudocolumnValues
field.Valid only when Record Format is set to Rich.
- In Body - New pseudocolumn values are included in
the
/newValues
field.Valid only when Record Format is set to Rich.
- In Header
- Omit
Default is In Header.
Available when Parse Statements is enabled.Oracle's Pseudo Columns Old Values Location of new values for pseudo columns in records: - In Extended Body - Old pseudocolumn values are
included in a
/oldPseudocolumnValues
field.Valid only when Record Format is set to Rich.
- In Body - Old pseudocolumn values are included in
the
/oldValues
field.Valid only when Record Format is set to Rich.
- In Header - Valid only when Record Format is set to Rich.
- Omit
Default is Omit.
Available when Parse Statements is enabled.
- Basic - Includes data and metadata about processed
changes.
-
On the Resilience tab, configure the following properties,
as needed:
Resilience Property Description Database Reconnect Interval (ms) Milliseconds to wait between trying to reconnect to Oracle. Database Reconnect Tries Maximum number of times to try to connect to Oracle. LogMiner Session Restart Interval (ms) Milliseconds to wait between trying to create a LogMiner session. LogMiner Session Restart Tries Maximum number of times to try to create a LogMiner session. Wait Time Before Session Start (ms) Milliseconds to wait before starting a LogMiner session. Wait Time After Session Start (ms) Milliseconds to wait before sending a query after a LogMiner session starts. Wait Time After Session End (ms) Milliseconds to wait after a LogMiner session ends before starting a new LogMiner session. Orphan Operations Event Scan Interval (ms) Milliseconds to wait between scans when the origin detects a transaction with data that is not yet persisted to the redo log. Default is ${5 * SECONDS * 1000}.
Orphan Operations Event Scan Tries Maximum number of times to rescan when the origin detects a transaction with data that is not yet persisted to the redo log. Default is 12.
Minimum Lag from Database Time (sec) Minimum number of seconds between the time an operation occurs and the database time, before the origin processes the operation.
Default is ${1 * SECONDS}.
-
Optionally, on the Tuning tab, configure properties to
tune the performance of the origin:
Tuning Property Description Max Batch Size (records) Maximum number of records processed at one time. Honors values up to the Data Collector maximum batch size. Default is 1000. The Data Collector default is 1000.
Max Batch Wait Time (ms) Number of milliseconds to wait before sending a partial or empty batch. Specify a value greater than 0, or set to -1 for no limit.
Max Batch Vault Size (batches) Maximum number of batches the origin pre-generates while the pipeline is processing other batches. Limiting the number of pre-generated batches can reduce the amount of memory used by the origin. When the Data Collector machine has severe memory constraints, StreamSets recommends setting the value to 1.
Default is 64.
Archived Logs Availability Check Frequency (ms) Frequency the origin checks for archived logs, in milliseconds. Used for standby databases only.
Eviction Check Frequency (ms) Frequency the origin checks for evictions in transactions, in milliseconds. Buffer Size Size of the internal buffer queues. This number correlates to capacity, not to memory size. Specify an integer that is a power of 2.
Default is 8192.
Buffer Wait Strategy Strategy to use when waiting for available buffers: - Blocking - Based on thread lock and wake-up. This is the slowest strategy, but the most conservative and consistent for the CPU.
- Sleeping - Based on busy wait loops, using thread pause. This is a slightly faster strategy that is also CPU conservative, but with lower latency.
- Yielding - Based on thread yield. This is a high-performing strategy aiming for low latencies. Recommended when hyper-threading is enabled.
- Busy Spin - Based on busy spin. This is the highest performing strategy that works better with Java 11 or higher. Recommended when hyper-threading is disabled.
LogMiner Contents Query Strategy Strategy used to query LogMiner. Use the table name or object ID. Note: The way the origin interacts with LogMiner, the table name strategy uses the following format instead of the actual table name:OBJ# <table object id>
.Query Fetch Size Number of rows to fetch with each query. Specify a larger number to reduce the number of times the origin queries the database. Larger query fetch sizes can require more memory.
Query Timeout (ms) Maximum milliseconds to wait for a query to complete. Set to -1 for no limit. Default is 10 minutes.
SQL Parser Threads Maximum number of threads to use to parse SQL in parallel. Set to 0 for no limit. Default is 32.
Summary Update Frequency (ms) Frequency to update information when you monitor the pipeline.