Release Notes
5.6.x Release Notes
- 5.6.2 on September 14, 2023
- 5.6.1 on August 22, 2023
- 5.6.0 on June 26, 2023
New Features and Enhancements
- New stages
-
- Aerospike Client destination - The new destination writes data to Aerospike.
- Kaitai Struct Parser processor - The new processor parses binary data using a Kaitai Struct format description.
- Snowflake Bulk origin - The new origin reads the available data from multiple Snowflake tables or views and then stops the pipeline. The origin can use multiple threads to perform parallel processing.
- Stage enhancements
-
- Amazon S3, Azure Blob Storage, Directory, and Google Cloud Storage origins - These origins now support the binary data type.
- Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2, Hadoop FS, Local FS, and MapR FS destinations - In the Compression Codec property, the Snappy option has been changed to Snappy (Airlift) to indicate that the destination uses the Airlift version of Snappy rather than the standard Snappy version.
- Azure Synapse SQL destination - On the Azure Synapse SQL tab, you can define property names and values in Additional Connection Properties to specify standard driver and Hikari properties.
- Databricks
Delta Lake destination and Databricks
Query executor - These stages are no longer part of an
enterprise stage library. These stages are now available in the
Databricks stage library,
streamsets-datacollector-sdc-databricks-lib
. The Databricks stage library requires the schemejdbc:databricks
rather thanjdbc:spark
in the URL or connection string. This change impacts upgrades. - Databricks Delta Lake destination:
- When processing CDC data, the destination can now use the primary key information from record header attributes. In the new Primary Key Location property, you configure where the destination finds the primary key, either in the header attributes or in the stage configuration for each table.
- The Key Columns property has been renamed Table Key Columns. The property is available if you set the Primary Key Location property to Specify for each table. For keys specified in the stage, the Key Columns property has been renamed Table Key Columns.
- With relaxed requirements for table names, the destination supports writing data to Delta Lake tables managed by Unity Catalog.
- HTTP Client processor:
- Records that are not processed before the batch wait time expires are sent for error processing rather than discarded. If One Request per Batch is enabled and the records are not processed before the batch wait time expires, all the records in the batch are sent for error processing.
- In the new Compression Library property, you can specify the compression library used to decompress files before reading. By default, the processor independently detects the compression library for each file. For the processor to read files compressed with the Airlift version of Snappy, you must select Snappy (Airlift Snappy) in the Compression Library property. For the processor to read files compressed with the standard version of Snappy, select the default option to detect the compression library automatically.
- MongoDB Atlas origin - The Initial Offset property no longer has a default value.
- Oracle Bulkload origin - The origin is no longer part of an
Enterprise stage library. The origin is included in the JDBC Oracle
stage
library,
streamsets-datacollector-jdbc-oracle-lib
. This change impacts upgrades. - Oracle CDC and Oracle CDC Client origins - The origins contain a new property, Fetch Strategy, that sets the method for staging LogMiner results. Staging to a disk-based queue can alleviate memory issues.
- Orchestration stages - You can configure the Start Jobs origin, Control Hub API processor, Start Jobs processor, or Wait for Jobs processor to use a Control Hub connection.
- Origins with a Compression Format property - For compressed formats, you can specify the compression library that the origin uses to decompress files in the new Compression Library property. By default, the origin independently detects the compression library for each file. For origins to read files compressed with the Airlift version of Snappy, including files from destinations, you must select Snappy (Airlift Snappy) in the Compression Library property. For origins to read files compressed with the standard version of Snappy, select the default option to detect the compression library automatically. This change impacts upgrades.
- Snowflake stages - All
Snowflake stages, including the Snowflake destination, the Snowflake
File Uploader destination, and the Snowflake executor, have improved
connection and authentication options:
- To connect to a virtual private Snowflake installation, you have two options. You can configure the stages to compute the virtual private URL automatically from the values in the Account property and either the Snowflake Region or Organization property. Alternatively, you can enter a custom JDBC URL.
- For authentication, the stages support OAuth and key pairs as alternatives to user credentials. To use other authentication methods, you can enter custom connection properties.
- Snowflake destination:
- The destination supports arrays. On the Data Advanced tab, you can use the ARRAY Default property to configure the default value for missing or incorrect array values.
- For Amazon S3 external stages, you can configure the S3 Tags property to specify a list of tags to add to created objects.
- On the Snowflake tab, the Error Behavior, Skip File On Error, Max Error Records, and Max Error Record Percentage properties have been moved to the bottom of the tab.
- SQL Parser processor - For tables with a primary key, the processor
includes the following new record header attributes to track changes
in the primary key:
jdbc.primaryKeySpecification
- Includes a JSON-formatted string that lists the columns that form the primary key in the table and the metadata for those columns.jdbc.primaryKey.before.<primary key column>
- Includes the previous value for the specified primary key column.jdbc.primaryKey.after.<primary key column>
- Includes the new value for the specified primary key column.
- SQL Server CDC Client origin - The Combine Update Records
property has been replaced by the Record
Format attribute. You can choose between three options to
configure how the origin generates records:
- Basic - Generates two records for updates, one with the old data and one with the changed data. This option produces the same result as if Combine Update Records were set to false.
- Basic discarding ‘Before Update’ records - Generates one record for updates, containing the changed data.
- Rich - Generates one record for updates with data written to the Data field, OldData field, or both. This option produces the same result as if Combine Update Records were set to true.
The origin generates a record header property named
record_format
, which indicates the format of the generated record: 1 indicates basic format, 2 indicates basic discarding “before update” records, and 3 indicates rich.
- Stage libraries
- This release includes the following new stage
libraries:
Stage Library Description streamsets-datacollector-aerospike-client-lib For Aerospike 6.x. streamsets-datacollector-apache-kafka_3_4-lib For Kafka version 3.4.x. streamsets-datacollector-cdp_7_1_8-lib For Cloudera CDP 7.1.8. streamsets-datacollector-kaitai-lib For Kaitai Struct. streamsets-datacollector-sdc-databricks-lib For Databricks. - Connections
-
- Aerospike connection - You can use this new Control Hub connection with the Aerospike Client destination.
- Databricks Delta Lake connection - The connection requires the
Databricks stage library,
streamsets-datacollector-sdc-databricks-lib
. The Databricks stage library requires the schemejdbc:databricks
rather thanjdbc:spark
in the URL. This change impacts upgrades. - Orchestrator connection - You can use this new Control Hub connection with the Start Jobs origin, Control Hub API processor, Start Jobs processor, or Wait for Jobs processor.
- Snowflake connection:
- You can use the Snowflake connection with the new Snowflake Bulk origin.
- The connection has improved connection and authentication options, described above for the Snowflake stages under “Stage enhancements.”
- Tests of the Snowflake connection properly use every configuration set under Connection Properties.
- Additional enhancements
-
- Microsoft Azure Key Vault credential store - Data Collector can use managed identities in addition to client keys to authenticate with Azure Key Vault.
- The
sdc.properties
file contains a new property. When the property is enabled, Data Collector provides an HTTP Strict-Transport-Security (HSTS) response header. To enable the new property, you must configure thehttps.port
property. - Time
functions - The StreamSets expression language includes the following new time functions:
time:nowNumber
- Creates a Date object set to the current date and time with milliseconds precision.time:nowNanoInstant
- Creates a LocalDateTime object set to the current date and time with nanoseconds precision.time:nowNanoZonedInstant
- Creates a ZonedDateTime object set to the current date and time with nanoseconds precision.time:nowNanoTimestampNumber
- Creates a Number object set to the current time, specified as epoch time, with nanoseconds precision.time:nowNanoTimestampString
- Creates a String object set to the current time, specified as epoch time, with nanoseconds precision.
- Timestamp functions - The StreamSets expression language introduces a new type of function that offers
improved performance over time functions. You can replace a time
function with the corresponding timestamp function for better
performance. StreamSets recommends using timestamp functions for all new development. The
expression language includes the following timestamp functions:
timestamp:nowDate
- Creates a Date object set to the current date and time with millisecond precision. Use an alternative to thetime:now
function.timestamp:nowLocal
- Creates a LocalDateTime object set to the current date and time with nanosecond precision. Use as an alternative to thetime:now
function.timestamp:nowMilliseconds
- Creates a Long object set to the current date and time with millisecond precision. Use as an alternative to thetime:nowNumber
function.timestamp:nowMillisecondsString
- Creates a String object set to the current date and time with millisecond precision. Use as an alternative to thetime:nowNanoTimestampString
function.timestamp:nowNanoseconds
- Creates a Double object set to the current date and time with nanosecond precision. Use as an alternative to thetime:nowNanoTimestampNumber
function.timestamp:nowNanosecondsString
- Creates a String object set to the current date and time with nanosecond precision. Use as an alternative to thetime:nowNanoTimestampString
function.timestamp:nowZoned
- Creates a ZonedDateTime object set to the current date and time with nanosecond precision. Use as an alternative to thetime:nowNanoZonedInstant
function.timestamp:extractStringFromDate
- Converts a Date object into a String object, based on a specified date-time format, with nanosecond precision. Use as an alternative to thetime:extractStringFromDate
function.timestamp:extractStringFromDateAndZone
- Converts a Date object into a String object, based on a specified date-time format and a time zone, with nanosecond precision. Use as an alternative to thetime:extractStringFromDateTZ
function.timestamp:extractLongFromDate
- Converts a Date object into a Long object, based on a specified date-time format, with nanosecond precision. Use as an alternative to thetime:extractLongFromDate
function.timestamp:createDateFromString
- Converts a String object into a Date object, based on a specified date-time format, with nanosecond precision. Use as an alternative to thetime:extractDateFromString
function.timestamp:createDateFromStringAndZone
- Converts a String object into a Date object based on a specified date-time format and a time zone, with nanosecond precision. Use as an alternative to thetime:createDateFromStringTZ
function.
- Data lineage publication - For each job run, Data Collector can publish data lineage information in a JSON file. You can
export and use the file in your data governance solution. To enable
data lineage publication, you must add the following lines to the
sdc.properties
file:
lineage.publishers=json lineage.publisher.json.def=streamsets-datacollector-basic-lib::com_streamsets_pipeline_lib_lineage_JSONLineagePublisher lineage.publisher.json.config.outputDir=<directory for files>
You can optionally include the following line:lineage.publisher.json.config.saveInterval=<time in milliseconds>
Data Collector publishes the data lineage information at the specified interval. By default, the interval is set to 60,000 milliseconds. If you specify a negative interval, Data Collector only publishes the information when the pipeline finishes.
Data Collector publishes data lineage information for the following stages:- Databricks Job Launcher executor
- Dev Data Generator origin
- Google Big Query origin and destination
- Google Cloud Storage origin and destination
- Google Pub Sub Publisher destination
- HBase destination
- HTTP Client origin
- JDBC Multitable Consumer origin
- JDBC Query Consumer origin
- Kinesis Consumer origin
- Kinesis Producer destination
The following stages include field-level data:- JDBC Multitable Consumer origin
- JDBC Query Consumer origin
- Dev Data Generator origin
- HBase destination
Upgrade Impact
- Review bucket properties in Amazon S3 stages
- Starting with version 5.6.0, you can no longer include the forward slash (/)
in the following properties for Amazon S3 stages due to an Amazon Web
Services SDK upgrade:
- Bucket property for the Amazon S3 origin
- Bucket and path property for the Amazon S3 destination and executor
- Install the Databricks stage library to use the Databricks Delta Lake destination, Databricks Query executor, and Databricks Delta Lake connection
- Starting with version 5.6.0, the Databricks Delta Lake destination, Databricks Query executor, and Databricks Delta Lake connection require the Databricks stage library. In previous releases, they required the Databricks Enterprise stage library.
- Review the JDBC URL in the Databricks Delta Lake destination and the JDBC Connection String in the Databricks Query executor
- Starting with version 5.6.0, the scheme of the URL or connection string for
the Databricks Delta Lake destination and Databricks Query executor is
jdbc:databricks
rather thanjdbc:spark
. - Update the JDBC URL in the Databricks Delta Lake connection
- Starting with version 5.6.0, the scheme of the URL is
jdbc:databricks
rather thanjdbc.spark
. - Install the Oracle stage library to use the Oracle Bulkload origin
- Starting with version 5.6.0, the Oracle Bulkload origin requires the JDBC Oracle stage library. In previous releases, the origin required the Oracle Enterprise stage library.
- Grant users view access for the Oracle CDC origin
- Starting with version 5.6.0, the Oracle CDC origin must use a user account
with access to the
v$containers
view. - Update origins and processors that read files compressed with the Airlift version of Snappy
- Starting with version 5.6.0, origins that read compressed files require you
to set the Compression Library property to properly read files compressed
with the Airlift version of Snappy. Destinations compress files with the
Airlift version of Snappy. This affects the HTTP Client processor and the
following origins:
- Amazon S3
- Azure Blob Storage
- Azure Data Lake Storage Gen1
- Azure Data Lake Storage Gen2 (Legacy)
- Azure IoT/Event Hub Consumer
- CoAP Server
- Directory
- File Tail
- Hadoop FS Standalone
- Google Cloud Storage
- Google Pub/Sub Subscriber
- gRPC Client
- HTTP Client
- HTTP Server
- Kafka Multitopic Consumer
- MQTT Subscriber
- REST Service
- SFTP/FTP/FTPS Client
- TCP Server
- WebSocket Client
- WebSocket Server
5.6.2 Fixed Issues
- The Snowflake and Snowflake File Uploader destinations and the Snowflake
executor do not release threads after a pipeline stops.
- Orchestration stages do not retry requests after the requests time out.
5.6.1 Fixed Issues
- The Azure Synapse SQL, Databricks Delta Lake, Google BigQuery, and Snowflake destinations might generate invalid date, datetime, time, or zoned datetime formats when configured to write data to multiple tables and the Connection Pool Size is set to 0 or to a value greater than 1.
5.6.0 Fixed Issues
- The Snowflake destination does not consolidate records properly when using CDC.
- The Oracle CDC Client origin does not generate batches if only error records are generated.
- To avoid data loss when processing LogMiner records, the Oracle CDC Client origin must store all records from a time window before processing them, which consumes excessive memory in many cases.
- The Oracle CDC origin mixes operations of target pluggable databases and unrelated pluggable databases. This fix impacts upgrades.
- The
time:dateAddition
andtime:dateDifference
functions fail to validate because they expect datetime values in the LocalDateTime date format, which the StreamSets expression language cannot interpret. - In the HTTP Client stages and the Control Hub API processor, setting the Log Level property to a level higher than Info results in no messages written to the log.
- The Control Hub API processor does not correctly log request and response data.
- Using the JDBC Multitable Consumer origin to read from an Oracle database with multiple threads results in cursor leaks.
- Previewing the Azure Data Lake Storage Gen2 origin results in a timeout and no records generated.
- The Local FS destination stops processing files upon encountering a file name with a colon.
- The SFTP/FTP/FTPS Client destination cannot write files if configured to use the SFTP protocol to connect to a remote server with a storage layer in a Google Cloud Services bucket.
- The MySQL BInary Log origin cannot process queries for table or database names with hyphens.
- Control Hub randomly provides users and engines with 401 authorization errors.
5.6.x Known Issues
- A Databricks Delta Lake destination staged on ADLS Gen2 and using OAuth 2.0
authentication can fail with a
DATA_LOADING_10
error. This can occur because the destination needs to perform the Get Blob Service Properties operation, and the OAuth 2.0 account does not have the appropriate permissions.Workaround: Grant the OAuth 2.0 account the necessary permissions to perform the Get Blob Service Properties operation. For more information, see the Microsoft documentation.
5.5.x Release Notes
The Data Collector 5.5.0 release occurred on April 28, 2023.
New Features and Enhancements
- New stages
-
- Azure Data Lake Storage Gen2
origin - The new Azure Data Lake Storage Gen2 origin
reads data from Microsoft Azure Data Lake Storage Gen2. The new
origin connects to Azure Data Lake Storage Gen2 through the API,
which results in improved performance.
The existing Azure Data Lake Storage Gen2 origin has been renamed Azure Data Lake Storage Gen2 (Legacy). Use the new origin for all new development.
- Azure Blob Storage origin - The new Azure Blob Storage origin reads data from Microsoft Azure Blob Storage.
- SingleStore destination - The new SingleStore destination writes data to a SingleStore database table.
- Azure Data Lake Storage Gen2
origin - The new Azure Data Lake Storage Gen2 origin
reads data from Microsoft Azure Data Lake Storage Gen2. The new
origin connects to Azure Data Lake Storage Gen2 through the API,
which results in improved performance.
- Stage enhancements
-
- Azure Data Lake Storage Gen2 stages - These stages support Azure Managed Identity authentication.
- Azure Synapse SQL destination:
- The stage is no longer part of an Enterprise stage library.
The stage is part of the Azure stage library,
streamsets-datacollector-azure-lib
. This change has an upgrade impact. - With the new Propagate Numeric Precision and Scale property enabled, the destination creates new numeric columns with the precision and scale specified in JDBC record header attributes. With the property disabled or when JDBC header attributes are not available, the stage creates new numeric columns with the default numeric definition.
- The stage is no longer part of an Enterprise stage library.
The stage is part of the Azure stage library,
- Google Big Query destination - The destination can stage data in JSON files.
- Hive Metadata processor, Hive Metastore destination, and Hive Query executor - In these stages, the Additional Hadoop Configuration property supports credential values.
- MongoDB Atlas origin and destination - For pipelines with a MongoDB Atlas origin or destination, validation passes if at least one host specified in the connection string is reachable. Validation fails if all of the hosts specified in the connection string are unreachable.
- MongoDB Atlas destination - The destination can update or upsert a map, list, or list-map as a nested or non-nested MongoDB document, depending on the number of unique keys.
- OPC UA Client
origin:
- The new Max Recursion Depth property defines the maximum depth to browse for recursive processing.
- The Channel Config tab has been renamed Encoding Limits.
- The Max Array Length and Max String Length properties have been removed because they are redundant. The existing Max Message Size property limits the size of the message. This change has an upgrade impact.
- PostgreSQL CDC Client and Aurora PostgreSQL CDC Client origins - The origins interpret the values in the Poll Interval property as milliseconds.
- Snowflake File Uploader destination - For the file-closure event, the destination writes the file name and path information to two fields: Filename and Filepath.
- SQL Server CDC Client origin - The origin generates the primary key record header attributes regardless of whether the Combine Update Records property is enabled.
- Connections
-
- Azure Blob Storage - This new Control Hub connection is included for use with the new Azure Blob Storage origin.
- Azure Data Lake Storage Gen2:
- You can use the connection with the new Azure Data Lake Storage Gen 2 origin.
- The connection now supports Azure Managed Identity authentication.
- Azure Synapse - Because the Azure Synapse SQL destination is no
longer part of an Enterprise stage library, you must install the
Azure stage library,
streamsets-datacollector-azure-lib
, to configure or use Azure Synapse connections. This change has an upgrade impact.
- New support
-
- Cloudera Manager - To support Data Collector installation with Cloudera Manager, StreamSets now provides a Cloudera parcel for RHEL 8.
- Additional enhancements
-
- Runtime properties - The
runtime.conf.location
property supports both a relative and absolute path. When configuring a separate runtime properties file, specify a relative path for a file inside the Data Collector installation directory or specify an absolute path for a file outside the Data Collector installation directory. - Error information - A new Error Information Level property sets the amount of error information included in email. You configure the property for notifications of changes in pipeline state and for notifications triggered by rules.
- Health Inspector - The Health Inspector page includes information about the operating system and version, process uptime, engine time zone, external IP address, and ping and traceroute attempts to Control Hub.
- Time
functions - The StreamSets expression language includes two new time functions to operate on
dates:
time:dateAddition
- Adds an interval to a date.time:dateDifference
- Determines the interval between two dates.
- Runtime properties - The
Upgrade Impact
- Install the Azure stage library to use the Azure Synapse SQL destination and connection
- Starting with version 5.5.0, the Azure Synapse SQL destination and Azure Synapse connection require the installation of the Azure stage library. In previous releases, the destination and connection required the Azure Synapse Enterprise stage library.
- Review pipelines with Salesforce stages that import date values
- Starting with version 5.5.0, Salesforce stages correctly import date values as dates rather than as strings.
- Review the maximum message size for OPC UA Client pipelines
- Starting with version 5.5.0, the OPC UA Client origin no longer includes the Max Array Length or Max String Length properties. These properties were removed because they are redundant. The existing Max Message Size property properly limits the message size regardless of the data type of the message.
5.5.0 Fixed Issues
- The JDBC Producer destination does not update the content of a column that was added by the PostgreSQL Metadata processor.
- The MongoDB Atlas destination does not write StreamSets Map and List fields as expected for update and upsert operations.
- The Oracle CDC origin does not properly handle null and empty values when converting from hex to target data types.
- The Oracle CDC origin provides negative values for some summary counters.
- The Oracle CDC origin does not correctly process time zones expressed as UTC Offset Standard Time, such as +05:00 or -07:00.
- The Oracle CDC origin considers a column to be a pseudocolumn if its name matches a documented pseudocolumn, even if the table definition includes the column.
- The Oracle CDC origin converts Oracle Date columns to the Data Collector Date data type instead of Datetime.
- The Oracle CDC Client origin reads data slower when using continuous mining due to changes in the caching algorithm.
- The Oracle CDC Client origin fails to parse LOB_WRITE, LOB_TRIM, and LOB_ERASE records that contain Blob or Clob fields when the Use PEG Parser property is enabled.
- The Salesforce stages import date values as strings. This fix has an upgrade impact.
5.5.x Known Issues
- The
time:dateAddition
andtime:dateDifference
functions fail to validate because they expect datetime values in the LocalDateTime date format, which the StreamSets expression language cannot interpret.
5.4.x Release Notes
The Data Collector 5.4.0 release occurred on February 28, 2023.
New Features and Enhancements
- Oracle CDC support
- You can use a new Oracle CDC origin to process change data from Oracle redo logs. Like the original Oracle CDC Client origin, the new Oracle CDC origin uses LogMiner to access online or archived redo logs.
- Stage enhancements
-
- Amazon stages - Amazon stages include updated regions in the AWS Region property.
- Azure Data Lake Storage Gen2 stages - You can configure the new Endpoint URL property when using OAuth with Service Principal authentication with the Azure Data Lake Storage Gen2 origin and destination, and the ADLS Gen2 File Metadata executor.
- JDBC Multitable Consumer origin - You can configure the new Maximum Number of Tables property to limit the number of tables to prefetch.
- Google BigQuery destination - The destination allows enabling the
Create Table property only when it is configured to handle data
drift.
This is a reversion of a change in 5.3.0, which allows creating tables when the destination is not configured to handle data drift.
- JDBC Tee processor and JDBC Producer - You can use a new
useLegacyZonedDatetime
JDBC configuration property to help with MySQL driver upgrade issues.MySQL driver versions 8.0.23 and later return zoned datetimes in a different format. If you upgrade from an older MySQL driver to 8.0.23 and later, you can add
useLegacyZonedDatetime
as an Additional JDBC Configuration property and set it totrue
to have the stages provide zoned datetimes in the previous format. - MongoDB Atlas stages - You can configure the UUID Interpretation Mode advanced property to specify how a MongoDB Atlas stage handles UUID fields.
- Orchestration stages - Orchestration stages that connect to Control Hub include new Max Number Of Tries and Retry Interval properties that determine how the stages try to connect to Control Hub after encountering communication errors.
- Salesforce stages - All Salesforce stages now use version 57.0.0 of the Salesforce API by default.
- Snowflake stages: The following enhancements apply to the Snowflake
and Snowflake File Uploader destinations, and the Snowflake
executor:
- Snowflake stages are no longer part of an Enterprise stage
library. The stages are now available in the Snowflake stage
library,
streamsets-datacollector-sdc-snowflake-lib
. This change has upgrade impact. - When specifying an organization on the Snowflake Connection Info tab, you no longer need to specify a Snowflake region.
- Snowflake stages can access all of the latest regions for AWS, GCP, and Azure.
- Snowflake stages are no longer part of an Enterprise stage
library. The stages are now available in the Snowflake stage
library,
- Snowflake destination:
- The destination can automatically create tables when configured to
handle data drift or use Snowpipe to load
data.
Previously, it did not create tables when using Snowpipe to load data.
- When you configure the Snowflake destination to use an Amazon S3 staging location, you no longer specify the S3 region. Data Collector now queries Snowflake for that information.
- The destination can automatically create tables when configured to
handle data drift or use Snowpipe to load
data.
- Snowflake executor - You can use the new Warehouse property to define the warehouse to connect to.
- SQL Parser processor - You can configure the Parsing Thread Pool Size property to enable the processor to use multiple threads when processing data.
- Support
-
- PostgreSQL 15.x support - You can use the PostgreSQL CDC Client origin and JDBC stages with PostgreSQL 15.x.
- Red Hat Enterprise Linux support - You can install Data Collector on Red Hat Enterprise Linux version 9.x, in addition to 6.x - 8.x.
- New stage library:
Stage Library Description streamsets-datacollector-apache-kafka_3_3-lib For Kafka version 3.3.x. streamsets-datacollector-sdc-snowflake-lib For Snowflake.
- Additional enhancements
-
- New function - A new
pipeline:email()
function returns the email address of the user who started the pipeline. - Data governance tools - Data governance tools can now publish metadata about the Kinesis Consumer origin.
- Java Security Manager - Data Collector can use the Java Security
Manager only when using Java 8. Oracle has deprecated and marked
Java Security Manager for removal. As a result, when using Java 9 or
later, Data Collector cannot use the security manager.
Previous releases enabled the Java Security Manager by default for all Java versions, which in some cases caused known issues when using Java 9 or later. To avoid those issues, you had to disable the security manager by setting the
SDC_SECURITY_MANAGER_ENABLED
environment variable to false.
- New function - A new
Upgrade Impact
- Install the Snowflake stage library to use Snowflake stages and connections
-
Starting with version 5.4.0, using Snowflake stages and Snowflake connections requires installing the Snowflake stage library. In previous releases, Snowflake stages and connections were available with the Snowflake Enterprise stage library.
5.4.0 Fixed Issues
- When using JDBC stages with MySQL driver versions 8.0.16 and later, the stages can encounter data type conversion errors that cause an entire batch to be treated as error records instead of the individual records with the problem.
- The Oracle CDC Client origin can generate a null pointer exception when processing user-defined columns whose data type is ROWID. Oracle pseudocolumns are not affected by this issue.
- The Oracle CDC Client origin can generate a null pointer exception when the pipeline stops when the origin is processing error records.
- When the Oracle CDC Client origin scans redo logs to check for session integrity issues within a LogMiner window, current offset handling can cause it to lose data if missing changes appear in an unexpected order.
- The Oracle CDC Client origin generates Update records for Inserts that include Blob or Clob fields.
- Data Collector fails to start when the Java installation package version includes only the major version, without a specified minor version. For example, a version packaged as Java 11 can cause Data Collector to fail to start, but Data Collector starts as expected with a Java 11.5 package.
- The PostgreSQL CDC origin generates a null pointer exception when processing null values in numeric fields.
- The Snowflake destination does not perform case-sensitive evaluation of primary keys or properly honor the Upper Case Schema and Field Names property.
- The JDBC Multitable Consumer origin can generate a null pointer exception when using multiple threads.
- When upgrading from Data Collector 5.2.0 or earlier to 5.3.0, JDBC Multitable Consumer pipelines do not upgrade properly when the Number of Threads property is defined using an expression.
- The Cassandra destination includes two Write Timeout properties, instead of a Write Timeout property and a Socket Read Timeout property.
- Kafka messages headers are not available when using Kafka Java client version 0.11 even though the library supports them.
5.4.x Known Issues
- The Oracle CDC origin converts Oracle Date columns to the Data Collector Date data type instead of Datetime.
Workaround: Use a Field Type Converter processor to convert the Date field to Datetime.
5.3.x Release Notes
The Data Collector 5.3.0 release occurred on December 2, 2022.
New Features and Enhancements
- Java 11 and 17 support
- With this release, Data Collector supports Java 11 and 17 in addition to Java 8. Due to third-party requirements, some Data Collector features require a particular Java version. For more information, see Java Versions and Available Features.
- Stage enhancements
-
- Amazon S3 executor - The executor supports server-side encryption.
- Dev Data Generator origin - This development origin can generate field attributes to enable easily testing field attribute functionality in pipelines.
- Directory and HDFS Standalone origins - The Directory and HDFS
Standalone origins include a new Ignore Temporary Files property
that enables the origins to skip processing Data Collector temporary
files with a
_tmp_
prefix. - Field Flattener processor - The Output Type property allows you to choose the root field type for flattened records: Map or List-Map.
- Field Mapper processor - The Create New Paths property allows the processor to create new paths when changing record structures.
- Field Replacer processor - The Field Does Not Exist property
includes the following new options:
- Add New Field - Adds the fields defined on the Replace tab to records if they do not exist.
- Ignore New Field - Ignores any fields defined on the Replace tab if they do not exist.
These new options replace the Include without Processing option. Upgraded pipelines are set to Add New Field. This can have upgrade impact.
- Google BigQuery destination and executor - These stages are no
longer part of an Enterprise stage library. This includes the
following updates:
- The stages, previously known as the Google BigQuery (Enterprise) destination and the Google BigQuery (Enterprise) executor, are renamed to the Google BigQuery destination and Google BigQuery executor.
- The stages are now available in the Google
Cloud stage library,
streamsets-datacollector-google-cloud-lib
, and are available to install like any other Data Collector stage. This change has upgrade impact.
- Google BigQuery destination:
- The destination can write nested Avro data.
- The destination can use BigQuery to generate schemas.
- The destination now allows enabling the Create Table property when the destination is not configured to handle data drift.
- JDBC Multitable Consumer origin - You can no longer set the Minimum
Idle Connections property higher than the Number of Threads
property.
Upgraded pipelines have the Minimum Idle Connections property set to the same value as the Number of Threads property. This can have upgrade impact.
- Kafka message header support - The following functionality is
available when using a Kafka Java client version 0.11 or later:
- The Kafka Multitopic Consumer origin includes Kafka message headers as record header attributes.
-
The Kafka Producer destination includes all user-defined record header attributes as Kafka message headers when writing to Kafka.
- MongoDB Atlas destination - The destination can update nested documents.
- OPC UA Client origin - The origin includes a new Override Host property which overrides the host name returned from the OPC UA server with the host name configured in the resource URL.
- Oracle CDC Client origin - The minimum value for the
following advanced properties have changed from 1 millisecond to 0
milliseconds:
- Time between Session Windows
- Time after Session Window Start
- Start Job origin and processor - The Search Mode property enables the origin and processor to search for the Control Hub job to start.
- Connections
-
- Google BigQuery update - Due to the BigQuery change from enterprise,
you must install the Google Cloud stage library,
streamsets-datacollector-google-cloud-lib
, to configure or use Google BigQuery connections. This change has upgrade impact.
- Google BigQuery update - Due to the BigQuery change from enterprise,
you must install the Google Cloud stage library,
- Stage libraries
-
This release includes the following new stage libraries:
Stage Library Description streamsets-datacollector-mapr_7_0-lib For MapR 7.0.x. streamsets-datacollector-mapr_7_0-mep8-lib For MapR 7.0.x with MEP 8.x. - Additional functionality
-
- Data governance tools - Data governance tools support
publishing metadata about the following additional stages:
- HTTP Client origin
- JDBC Multitable Consumer origin
- Advanced data format properties - Optional properties for data formats have become advanced options. You must now view advanced options to configure them.
- Delimited data format enhancements:
- When writing delimited data using a destination or the HTTP Client processor, you can define properties to enable writing multicharacter delimited data.
- When writing delimited data using a custom delimiter format, you can configure the Record Separator String property to define a custom record separator.
- Pipeline start events - Pipeline start event records include a new
email
field that contains the email address of the user who started the pipeline.
- Data governance tools - Data governance tools support
publishing metadata about the following additional stages:
Upgrade Impact
- Install the Google Cloud stage library to use BigQuery stages and connections
- Starting with version 5.3.0, using Google BigQuery stages and Google BigQuery connections require installing the Google Cloud stage library. In previous releases, BigQuery stages and connections were available with the Google BigQuery Enterprise stage library.
- Review minimum idle connections for JDBC Multitable Consumer origins
- Starting with version 5.3.0, the Minimum Idle Connections property in the JDBC Multitable Consumer origin cannot be set higher than the Number of Threads property. In previous releases, there was no limit to the number of minimum idle connections that you could configure.
- Review missing field behavior for Field Replacer processors
- Starting with version 5.3.0, the advanced Field Does Not Exist property in
the Field Replacer processor has the following two new options that replace
the Include without Processing option:
- Add New Field - Adds the fields defined on the Replace tab to records if they do not exist.
- Ignore New Field - Ignores any fields defined on the Replace tab if they do not exist.
- Review runtime:loadResource pipelines
- Starting with version 5.3.0, pipelines that include the
runtime:loadResource
function fail with errors when the function calls a missing or empty resource file. In previous releases, those pipelines sometimes continued to run without errors.
5.3.0 Fixed Issues
-
To avoid the Text4Shell vulnerability, Data Collector 5.3.0 is packaged with version 1.10.0 of the Apache Commons Text library.
-
When used in some locations, the
runtime:loadResource
function can silently fail and stop the pipeline when trying to load an empty or missing resource file, giving no indication of the problem.With this fix, when failing to load an empty or missing resource file, the
runtime:loadResource
function generates an error that stops the pipeline. This fix has upgrade impact. -
Pipelines cannot be deleted when Data Collector uses Network File System (NFS).
-
When a query fails to produce results, the Elasticsearch origin stops when the network socket times out. With this fix, the origin continues retrying the query until the cursor expires.
-
The Email executor does not work when using certain providers, such as SMPT, due to a change in the version of a file.
-
The SQL Server CDC Client and SQL Server Change Tracking origins fail to function properly in a Control Hub fragment.
- When the Oracle CDC Client origin is not configured to process Blob or Clob columns, the origin includes Blob and Clob field names in the record with either null values or raw string values depending on whether the Unsupported Fields to Records property is enabled.
- The
java.security.networkaddress.cache.ttl
Data Collector configuration property does not cache Domain Name Service (DNS) lookups as expected.
5.3.x Known Issues
- Data Collector fails to start when the Java installation package version includes only the
major version, without a specified minor version. For example, a version
packaged as Java 11 can cause Data Collector to fail to start, but Data Collector starts as expected with a Java 11.5 package.
Workaround: Upgrade to Data Collector 5.4.0 or later, where this issue is fixed. Or, use a Java installation package with a minor version.
You can check the Java installation package version installed on a machine by running the following command:
java -- version
- When upgrading from Data Collector 5.2.0 or earlier to 5.3.0, JDBC Multitable
Consumer pipelines do not upgrade properly when the Number of Threads property
is defined using an expression.
Workaround: Upgrade to Data Collector 5.4.0 or later, where this issue is fixed. Or, replace the expression in the Number of Threads property with a static value.
- Kafka messages headers are not available when using Kafka Java client version 0.11 even though the library supports them.
5.2.x Release Notes
The Data Collector 5.2.0 release occurred on September 29, 2022.
New Features and Enhancements
- New stages
-
- MongoDB Atlas origin and destination - You can use the new MongoDB Atlas origin and destination to read from and write to MongoDB Atlas and MongoDB Enterprise Server.
- Stage enhancements
-
- Groovy stages - The Groovy Scripting origin and the Groovy Evaluator processor now support Groovy 4.0.
- JDBC Multitable Consumer origin - The origin now
provides the
jdbc.primaryKeySpecification
record header attribute for records from tables with a primary key and thejdbc.vendor
record header attribute for all records. - JDBC Tee processor and JDBC Producer destination - These stages can
manage primary key values updates using the
jdbc.primaryKey.before.columnName
record header attribute for the old value and thejdbc.primaryKey.before.columnName
record header attribute for the new value. - MQTT stages - The MQTT Subscriber origin and the MQTT Publisher destination now support entering a list of brokers from high availability MQTT clusters without a load balancer.
- Oracle CDC Client origin:
- Conditional Blob and Clob support - When the
origin buffers changes locally, you can configure the origin
to process Blob and Clob data using the following advanced
properties:
- Enable Blob and Clob Columns Processing property - Enable this property to process Blob and Clob columns.
- Maximum LOB Size property - Optional property to define the maximum LOB size. When specified, overflow data is discarded.
- LogMiner Query Timeout property - This property defines how long the origin waits for a LogMiner query to complete.
- Time between Session Windows property - This advanced property sets the time to wait after a LogMiner session has been completely ingested. This ensures a minimum LogMiner window size.
- Time after Session Window Start property - This advanced property sets the time to wait after a LogMiner session starts. This allows Oracle to finish setting up before processing begins.
- Conditional Blob and Clob support - When the
origin buffers changes locally, you can configure the origin
to process Blob and Clob data using the following advanced
properties:
- Pipeline Finisher executor - The executor includes new React to
Events and Event Type properties that enables the executor to stop a
pipeline only upon receiving the specified event record type.
For example, you can now configure the executor to stop the pipeline only after receiving a no-more-data event record, and to ignore all other records that it might receive. Previously, you might have used a precondition or a Filter processor to ensure that the executor received only no-more-data events.
- Snowflake stages - Snowflake stages have been updated to support all Snowflake regions.
- SQL Server CDC Client origin - The origin can be
configured to combine the two update records that SQL Server creates
into a single record and generate the records differently.
With this property enabled, the origin generates record header attributes about the primary key.
- SQL Server Change Tracking origin - The origin generates record header attributes about the primary key.
- Connections when registered with Control Hub
- When Data Collector version 5.6.0 is registered with Control Hub cloud or with Control Hub on-premises version 3.19.x or later, the following stages support using Control Hub connections:
- Stage libraries
-
This release includes the following new stage libraries:
-
streamsets-datacollector-groovy_4_0-lib
- streamsets-datacollector-mongodb_atlas-lib
-
- Additional functionality
-
- New Data Collector configuration property - To cache Domain Name Service (DNS)
lookups, you can use the new
networkaddress.cache.ttl
property in the$SDC_DIST/etc/sdc-java-security.properties
file.With this change, the
java.security.networkaddress.cache.ttl
Data Collector property has been deprecated. - Help options - The Local Help option has been removed from the Help configuration option in Help > Settings. When you view Data Collector help, you will always view it on the StreamSets documentation website: https://docs.streamsets.com.
- New Data Collector configuration property - To cache Domain Name Service (DNS)
lookups, you can use the new
Upgrade Impact
- Review MySQL Binary Log pipelines
With 5.2.0, the MySQL Binary Log origin converts MySQL Enum and Set fields to String fields.
In previous releases, when reading from a database where the
binlog_row_metadata
MySQL database property is set toMINIMAL
, Enum fields are converted to Long, and Set fields are converted to Integer.In 5.2.0 as well as previous releases, when the
After you upgrade to 5.2.0, review MySQL Binary Log pipelines that process Enum and Set data from a database withbinlog_row_metadata
MySQL database property is set toFULL
, Enum and Set fields are converted to String.binlog_row_metadata
set toMINIMAL
. Update the pipeline as needed to ensure that Enum and Set data is processed as expected.- Review Oracle CDC Client pipelines
With 5.2.0, the Oracle CDC Client origin has new advanced properties that enable processing Blob and Clob columns. You can use these properties when the origin buffers changes locally. They are disabled by default.
In previous releases, the origin does not process Blob or Clob columns. However, when the Unsupported Fields to Records property is enabled, the origin includes Blob and Clob field names and raw string values in records.
Due to a known issue with this release, when the origin is not configured to process Blob and Clob columns and when the Unsupported Fields to Records property is enabled, the origin continues to include Blob and Clob field names and raw string values in records. When the property is disabled, the origin includes Blob and Clob field names with null values. The expected behavior is to always include field names with null values unless the origin is configured to process Blob and Clob columns.
5.2.0 Fixed Issues
- The MySQL Binary Log origin converts Enum and Set fields to different
field types based on how the
binlog_row_metadata
database property is set. This fix has upgrade impact. -
The Include Deleted Records property in the Salesforce Lookup processor does not display.
-
The Salesforce Bulk API destination can encounter problems when generating error records.
- Pipeline parameters do not work properly with required list properties.
5.2.x Known Issues
- The
java.security.networkaddress.cache.ttl
Data Collector configuration property does not cache Domain Name Service (DNS) lookups as expected.Workaround: Use the newnetworkaddress.cache.ttl
property in the$SDC_DIST/etc/sdc-java-security.properties
file. - When the Oracle CDC Client origin is not configured to process Blob or Clob columns, the origin includes Blob and Clob field names in the record with either null values or raw string values depending on whether the Unsupported Fields to Records property is enabled. This issue has upgrade impact.
5.1.x Release Notes
The Data Collector 5.1.0 release occurred on July 28, 2022.
New Features and Enhancements
- New stage
-
- Pulsar Consumer origin - The new Pulsar Consumer origin
can use multiple threads to read from Pulsar. The origin supports
schema validation and Pulsar namespaces configured to enforce schema
validation. You can specify the schema used to determine
compatibility between the origin and a Pulsar topic. You can also
use JWT authentication with the new origin.
With this new origin, the existing Pulsar Consumer has been renamed Pulsar Consumer (Legacy). Use this new Pulsar origin for all new development.
- Pulsar Consumer origin - The new Pulsar Consumer origin
can use multiple threads to read from Pulsar. The origin supports
schema validation and Pulsar namespaces configured to enforce schema
validation. You can specify the schema used to determine
compatibility between the origin and a Pulsar topic. You can also
use JWT authentication with the new origin.
- Stage enhancements
-
- Aurora
PostgreSQL CDC Client and PostgreSQL CDC
Client origins - Both origins can now generate a record
for each individual operation.
Previously, the origins could only generate a record for each transaction.
- Aurora
PostgreSQL CDC Client, PostgreSQL CDC
Client, and MySQL Binary
Log origins - These origins include the following new
record header attributes when a table includes a primary key:
jdbc.primaryKeySpecification
- Includes a JSON-formatted string that lists the columns that form the primary key in the table and the metadata for those columns.jdbc.primaryKey.before.<primary key column>
- Includes the previous value for the specified primary key column.jdbc.primaryKey.after.<primary key column>
- Includes the new value for the specified primary key column.
- Kafka Multitopic Consumer origin and Kafka Producer destination - The stages include a new Custom Authentication security option that enables specifying custom properties that contain the information required by a security protocol, rather than using predefined properties associated with other security options.
- OPC UA Server origin - The origin now supports using a user name and password to authenticate with the OPC UA server, in addition to an anonymous log in.
- Oracle CDC Client origin:
- The origin now supports reading from Oracle 21c databases.
- The field order of generated records now matches the column order in database tables. Previously, the field order was not guaranteed.
- When you configure the origin to use local buffers and write to disk, you can specify an existing directory to use.
- A new Data Collector configuration property affects the origin and can have upgrade impact. For details, see Upgrade Impact.
- Pulsar Consumer (Legacy) origin:
- The origin, formerly named Pulsar Consumer, has been renamed
with this release.
This change has no upgrade impact. However, we recommend using the new Pulsar Consumer origin, which supports multithreaded processing, to read from Pulsar.
- You can specify the schema used to determine compatibility between the origin and a Pulsar topic.
- You can also use JWT authentication with the origin.
- The origin, formerly named Pulsar Consumer, has been renamed
with this release.
- Salesforce Bulk API 2.0 stages:
- All Salesforce Bulk API 2.0 stages include a new Salesforce Query Timeout property which defines the number of seconds that the stage waits for a response to a query.
- The Salesforce Bulk API 2.0 origin and Salesforce Bulk API 2.0 Lookup processor both include a new Maximum Query Columns property that limits the number of columns that can be returned by a query.
- Scripting stages - The Groovy, JavaScript, and Jython Evaluator origins and processors now generate metrics for script execution and locking details that you can view when monitoring the pipeline.
- Field Remover - The processor now includes the On Record Error property on the General tab.
- Field Type Converter processor - When converting a Date, Datetime,
or Time field, the Date Format property now offers explicit options
to specify that the field contains a Unix timestamp in milliseconds
or seconds.
If the field contains a Unix timestamp and you select an alternate date format, then the behavior is unchanged: the processor assumes the timestamps are in milliseconds.
- SQL Parser processor:
- The processor adds the fields from the SQL statement in the same order as the corresponding columns in the database tables.
- The processor now includes field attributes for columns converted to the Decimal or Datetime data types in Data Collector. The attributes provide additional information for each field.
- Pulsar Producer destination - You can now use JWT authentication with the destination.
- Aurora
PostgreSQL CDC Client and PostgreSQL CDC
Client origins - Both origins can now generate a record
for each individual operation.
- Connection enhancements
-
- Kafka connection - The connection includes a new Custom Authentication security option that enables specifying custom properties that contain the information required by a security protocol, rather than using predefined properties associated with other security options.
- OPC UA connection - The connection now supports using a user name and password to authenticate with the OPC UA server in addition to an anonymous log in to the server.
- Pulsar connection - The connection can now use JWT authentication to connect to Pulsar.
- Snowflake connection - The new Connection Properties property enables you to specify additional connection properties for Snowflake connections.
- Enterprise stage libraries
- In August 2022, StreamSets released the following Enterprise stage libraries:
- Azure Synapse 1.3.0
- Databricks 1.7.0
- Snowflake 1.12.0
- Additional enhancements
-
- Data Collector Docker image - The Docker image for Data Collector 5.1.0,
streamsets/datacollector:5.1.0
, uses Ubuntu 20.04 LTS (Focal Fossa) as a parent image. This change can have upgrade impact. - Microsoft JDBC Driver for SQL Server - Data Collector uses version 10.2.1 of the driver to connect to Microsoft SQL Server. Due to changes in the driver, this can have upgrade impact.
- Data Collector configuration properties - You can define the
following new configuration properties:
stage.conf_com.streamsets.pipeline.lib.jdbc.disableSSL
- When this configuration property is set totrue
, Data Collector attempts to disable SSL for all JDBC connections. This property is commented out by default.stage.conf_com.streamsets.pipeline.stage.origin.jdbc.cdc.oracle.monitorbuffersize
- When this configuration property is set totrue
, Data Collector reports memory consumption when the Oracle CDC Client origin uses local buffers.This property is set to
false
by default. In previous releases, the origin reported this information by default, so this enhancement has upgrade impact.
- Data Collector Docker image - The Docker image for Data Collector 5.1.0,
- Stage libraries
- This release includes the following new stage libraries:
- streamsets-datacollector-apache-kafka_3_0-lib - For Apache Kafka 3.0.
- streamsets-datacollector-apache-kafka_3_1-lib - For Apache Kafka 3.1.
- streamsets-datacollector-apache-kafka_3_2-lib - For Apache Kafka 3.2.
Upgrade Impact
- Review SQL Server pipelines without SSL/TLS encrypted connections
- With 5.1.0, Data Collector uses Microsoft JDBC Driver for SQL Server version 10.2.1 to connect to Microsoft SQL Server. According to Microsoft, this version has introduced a breaking backward-incompatible change.
- Review reporting requirements for Oracle CDC Client pipelines
- With 5.1.0, pipelines that include the Oracle CDC Client origin no longer report memory consumption data when the origin uses local buffers. In previous releases, this reporting occurred by default, which slowed pipeline performance.
- Review Dockerfiles for custom Docker images
- In previous releases, the Data Collector Docker image used Alpine Linux as a parent image. Due to limitations in Alpine Linux, with this release the Data Collector Docker image uses Ubuntu 20.04 LTS (Focal Fossa) as a parent image.
5.1.0 Fixed Issues
- When the Oracle CDC Client origin is configured to use the PEG Parser for
processing, the
jdbc.primaryKey.before.<primary key column>
and thejdbc.primaryKey.after.<primary key column>
record header attributes are not set correctly. - If an Oracle RAC node is force stopped, the Oracle CDC client origin can stop producing records, although it is actually mining through a LogMiner session. A new LogMiner session needs to be recreated instead of retrying until system stabilizes. This issue is related to the internal tasks that Oracle runs after restarting a crashed node.
- Data loss can occur when the Oracle CDC Client origin does not use local buffering and the pipeline is stopped while a transaction that contains multiple operations and spans several seconds is being processed.
- Data Collector can fail to load a PostgreSQL driver correctly when you have pipelines that use different stages to access PostgreSQL.
- The Maximum Parallel Requests property in the HTTP Client processor and
destination does not work as expected.
This property was removed from the processor and destination because these stages do not support parallel requests.
- The MySQL Binary Log origin can fail when the order of columns in a source table
changes.
This issue is fixed when using the origin from Data Collector version 5.1.0 or later to read from MySQL 8.0 or later. However, you must set the
binlog_row_metadata
MySQL configuration property toFULL
. - The MySQL Binary Log origin can stall and stop processing due to a problem with an internal queue. If you attempt to stop the pipeline at that time, the pipeline can become non-responsive.
- When registered with Control Hub, the SFTP/FTP/FTPS connection does not include private key properties that enable configuring the connection to use private key authentication.
5.1.x Known Issues
There are no important known issues at this time.
5.0.x Release Notes
The Data Collector 5.0.0 release occurred on April 29, 2022.
New Features and Enhancements
- New stages
-
- Aurora PostgreSQL CDC Client origin - Use the origin to process Write-Ahead Logging (WAL) data to generate change data capture records for an Amazon Aurora PostgreSQL database.
- Salesforce Bulk API 2.0 origin - Reads from Salesforce using Salesforce Bulk API 2.0.
- Salesforce Bulk API 2.0 Lookup processor - Performs lookups on Salesforce data using Salesforce Bulk API 2.0.
- Salesforce Bulk API 2.0 destination - Writes to Salesforce using Salesforce Bulk API 2.0.
- Updated stages
-
- HTTP Client
enhancements - When using OAuth2 authentication with HTTP
Client stages, you can configure the following new properties:
- Use Custom Assertion and Assertion Key Type - Use these properties to specify a custom parameter for passing the JSON Web Token (JWT).
- JWT Headers - Use to specify headers to include in the JWT.
- JMS Producer destination - You can configure the destination
to remove the
jms.header
prefix from record header attribute names before including the information as headers in the JMS messages. - Kafka
Multitopic Consumer origin - The origin includes the
following new properties:
- Topic Subscription Type and Topic Pattern - Use these two properties to specify a regular expression that defines the topic names to read from, instead of simply listing the topic names.
- Metadata Refresh Time - Specify the milliseconds to wait before checking for additional topics that match the regular expression.
- Oracle CDC Client
origin - When a table includes a primary key, the origin
includes the following new record header attributes:
jdbc.primaryKeySpecification
- Includes a JSON-formatted string with all primary keys in the table and related metadata. For example:jdbc.primaryKeySpecification = {“<primary key name>":{"type":2,”datatype”:”VARCHAR","size":39,"precision":0,"scale":-127,"signed":true,"currency":true}, “primary key name 2":{"type":2,”datatype”:”VARCHAR”,"size":39,"precision":0,"scale":-127,"signed":true,"currency":true}}
jdbc.primaryKey.before.<primary key column>
- Includes the previous value for the specified primary key column.jdbc.primaryKey.after.<primary key column>
- Includes the new value for the specified primary key column.
- Pulsar Producer destination - Use the new Schema tab to specify the schema that Pulsar uses to validate the messages that the destination writes to a topic.
- Salesforce stages - All Salesforce stages now use version 54.0 of the Salesforce API by default.
- HTTP Client
enhancements - When using OAuth2 authentication with HTTP
Client stages, you can configure the following new properties:
- Connections when registered with Control Hub
-
- When Data Collector version 5.6.0 is registered with Control Hub cloud or with Control Hub
on-premises version 3.19.x or later, the following stages support
using Control Hub
connections:
- Aurora PostgreSQL CDC Client origin
-
Azure Synapse SQL destination
- Google BigQuery executor
- Hive stages
- OPC UA Client origin
- Salesforce Bulk API 2.0 stages
- When Data Collector version 5.6.0 is registered with Control Hub cloud or with Control Hub
on-premises version 3.19.x or later, the following stages support
using Control Hub
connections:
- Enterprise stage libraries
- In May 2022, StreamSets released the following Enterprise stage libraries:
- Azure Synapse 1.2.0
- Databricks 1.6.0
- Google 1.1.0
- Oracle 1.4.0
- Snowflake 1.11.0
- Additional enhancements
-
- Data Collector logs - Data Collector uses the Apache Log4j 2.17.2 library to write log data. In previous releases, Data Collector used the Apache Log4j 1.x library which is now end-of-life. This can have upgrade impact.
- Elasticsearch 8.0 support - You can now use Elasticsearch stages to read from and write to Elasticsearch 8.0.
- Credential stores property - A new credentialStores.usePortableGroups credential stores property enables migrating pipelines that access credential stores from one Control Hub organization to another. Contact StreamSets Support before enabling this option.
Upgrade Impact
- Data Collector log configuration
- With 5.0.0 and later, Data Collector uses the Apache Log4j 2.17.2 library to write log data. Data Collector includes the following log configuration files:
- sdc-log4j2.properties
- log4j2.component.properties
- Update Oracle CDC Client origin user accounts
- With 5.0.0 and later, the Oracle CDC Client origin requires additional Oracle permissions to ensure appropriate handling of self-recovery, failover, and crash recovery.
5.0.0 Fixed Issues
- The Oracle CDC Client origin can fail if redo logs are rotated as the origin
reads data from the current log. The origin can also fail when an Oracle RAC
node fails or recovers from a failure or planned shut down.
With this fix, the Oracle CDC Client origin can recover from additional recovery and maintenance scenarios, and in a more efficient fashion. However, the fix requires configuring additional permissions for the Oracle user. For more information, see Upgrade Impact.
- The Oracle CDC Client origin treats the underscore character ( _ ) as a
single-character wildcard in schema names and table name patterns, disallowing
the valid use of the character as an underscore character.
With this fix, you can use the character as an underscore by escaping it with a slash character ( / ). For example, to specify the NA_SALES table, you enter
NA/_SALES
. - Oracle CDC Client origin pipelines fail with null pointer exceptions when the origin is configured to buffer data locally to disk, instead of in memory.
- When the Oracle CDC Client origin Convert Timestamp to String advanced property is enabled, the origin does not properly handle unparsable timestamps.
- JDBC stages that read data, such as the JDBC Query Consumer origin or the JDBC Lookup processor, do not generate records after one of the JDBC stages encounters an error reading a table column.
- The JDBC Query Consumer origin incorrectly generates a no-more-data event when the limit in a query matches the configured max batch size.
- The JDBC Query Consumer origin is unable to read Oracle data of the Timestamp with Local Time Zone data type.
5.0.x Known Issues
There are no important known issues at this time.
4.4.x Release Notes
- 4.4.1 on March 24, 2022
- 4.4.0 on February 16, 2022
New Features and Enhancements
- Updated stages
-
- Amazon S3 stages - You can use an Amazon S3 stage to connect to Amazon S3 using a custom endpoint.
- Amazon S3 destination - You can configure the destination to add tags to the Amazon S3 objects that it creates.
- Base 64 Field Decoder and Encoder processors - You can configure the processors to decode or encode multiple fields.
- Google BigQuery (Legacy) destination - The destination, formerly called Google BigQuery, has been renamed and deprecated with this release. The destination may be removed in a future release. We recommend that you use the Google BigQuery destination to write data to Google BigQuery, which supports processing CDC data and handling data drift.
- Hive Query executor - You can use time functions in the SQL queries that execute on Hive or Impala. When using time functions, you can also select the time zone that the executor uses to evaluate the functions.
- HTTP
Client stages - You can configure additional security headers
to include in the HTTP requests made by the stage. Use additional
security headers when you want to include sensitive information, such as
user names or passwords, in an HTTP header.
For example, you might use the
credential:get()
function in an additional security header to retrieve a password stored securely in a credential store. - HTTP Client processor - You can configure the processor to send a single request that contains all records in the batch.
- JMS Producer destination - You can configure the destination
to include record header attributes with a
jms.header
prefix as JMS message headers.
- Pulsar stages - You can configure a Pulsar stage to use OAuth 2.0 authentication to connect to an Apache Pulsar cluster.
- Pulsar Consumer (Legacy) origin - The origin creates a
pulsar.topic
record header attribute that includes the topic that the message was read from. - Salesforce stages - Salesforce stages now use version 53.1.0 of the Salesforce API by default.
- Connections when registered with Control Hub
-
- When Data Collector version 5.6.0 is registered with Control Hub cloud or with Control Hub
on-premises version 3.19.x or later, the following stages support
using Control Hub
connections:
- CoAP Client destination
- Influx DB destination
- Influx DB 2.x destination
- Pulsar stages
- Amazon S3 enhancement - The Amazon S3 connection supports connecting to Amazon S3 using a custom endpoint.
- When Data Collector version 5.6.0 is registered with Control Hub cloud or with Control Hub
on-premises version 3.19.x or later, the following stages support
using Control Hub
connections:
- Credential stores
-
- Google Secret Manager - You can configure Data Collector to authenticate with Google Secret Manager using credentials in a Google Cloud service account credentials JSON file.
- Enterprise stage library
-
In February 2022, StreamSets released an updated Snowflake Enterprise stage library.
- Data Collector Edge
- You can no longer download the Data Collector Edge executable from Data Collector. You can download the Data Collector Edge executable from the StreamSets Support portal.
Upgrade Impact
- Encryption JAR file removed from Couchbase stage library
- With Data Collector 4.4.0 and later, the Couchbase stage library no longer includes an encryption JAR file that the Couchbase stages do not directly use. Removing the JAR file should not affect pipelines using Couchbase stages.
4.4.1 Fixed Issues
- In Data Collector version 4.4.0, the HTTP Client processor cannot write HTTP response data to an existing field. Earlier Data Collector versions are not affected by this issue.
- When a Kubernetes pod that contains Data Collector shuts down while a pipeline that includes a MapR FS File Metadata or HDFS File Metadata executor is running, the executor cannot always perform the configured tasks.
- Access to Control Hub through the Data Collector user interface times out.
Though this fix may have resolved the issue, as a best practice, use Control Hub to author pipelines instead of Data Collector.
4.4.0 Fixed Issues
- To address recently-discovered vulnerabilities in Apache Log4j 2.17.0 and earlier 2.x versions, Data Collector 4.4.0 is packaged with Log4j 2.17.1. This is the latest available Log4j version, and contains fixes for all known issues.
- The Oracle CDC Client origin does not correctly handle a daylight saving time change when configured to use a database time zone that uses daylight saving time.
- The MapR DB CDC origin does not properly handle records with null values.
- The Kafka Multitopic Consumer origin does not respect the configured Max Batch Wait Time.
- A state notification webhook always uses the POST request method, even if configured to use a different request method.
- When the HTTP Client origin uses OAuth authentication and the request returns 401 Unauthorized and 403 Forbidden statuses, the origin generates a new OAuth token indefinitely.
- The MapR DB CDC origin incorrectly updates the offset during pipeline preview.
- When Amazon stages are configured to assume another role and configured to connect to an endpoint, the stages do not redirect to the correct URL.
- JDBC origins encounter an exception when reading data with an incorrect date format, instead of processing the record as an error record.
- The Directory origin skips reading files that have the same timestamp.
- The JDBC Multitable Consumer origin cannot use a wildcard character (%) in the Schema property.
- The Azure Data Lake Storage Gen2 and Local FS destinations do not correctly shut down threads.
4.4.x Known Issue
- In Data Collector 4.4.0, the HTTP Client processor cannot write HTTP response data to an
existing field. Earlier Data Collector versions are not affected by this issue.
Workaround: If using Data Collector 4.4.0, upgrade to Data Collector 4.4.1, where this issue is fixed.
4.3.x Release Notes
The Data Collector 4.3.0 release occurred on January 13, 2022.
New Features and Enhancements
- Internal update
- This release includes internal updates to support an upcoming Control Hub feature in the StreamSets platform.
4.3.0 Fixed Issues
- To address recently-discovered vulnerabilities in Apache Log4j 2.16.x and earlier 2.x versions, Data Collector 4.3.0 is packaged with Log4j 2.17.0. This is the latest available Log4j version, and contains fixes for all known issues.
- Data Collector now sets a Java system property to help address the Apache Log4j known issues.
4.3.x Known Issues
There are no important known issues at this time.
4.2.x Release Notes
- 4.2.1 on December 23, 2021
- 4.2.0 on November 9, 2021
New Features and Enhancements
- New support
-
- Red Hat Enterprise Linux 8.x - Data Collector now supports installation on RHEL 8.x, in addition to 6.x and 7.x.
- New stage
-
- InfluxDB 2.x destination - Use the destination to write to InfluxDB 2.x databases.
- Updated stages
-
- Couchbase Lookup processor property name updates - For clarity, the
following property names have been changed:
- Property Name is now Sub-Document Path.
- Return Properties is now Return Sub-Documents.
- SDC Field is now Output Field.
- When performing a key value lookup and configuring multiple return properties, the Property Mappings property is now Sub-Document Mappings.
- When performing an N1QL lookup and configuring multiple return properties, the Property Mappings property is now Sub-N1QL Mappings.
- Einstein Analytics destination enhancements:
- The Einstein Analytics destination has been renamed the Tableau CRM destination to match the Salesforce rebranding.
- The new Tableau CRM destination can perform automatic recovery.
- HTTP Client stage statistics - HTTP Client stages provide additional metrics when you monitor the pipeline.
- PostgreSQL CDC Client origin - You can specify the SSL mode to use on the new Encryption tab of the origin.
- Salesforce destination - The destination supports performing hard deletes when using the Salesforce Bulk API. Hard deletes permanently delete records, bypassing the Salesforce Recycle Bin.
- Salesforce stages - Salesforce stages now use version 53.0.0 of the Salesforce API by default.
- SFTP/FTP/FTPS stages - All SFTP/FTP/FTPS Client stages now support HTTP and Socks proxies.
- Couchbase Lookup processor property name updates - For clarity, the
following property names have been changed:
- Connections when registered with Control Hub
-
- When Data Collector version 5.6.0 is registered with Control Hub cloud or with Control Hub
on-premises version 3.19.x or later, the following stage
supports using Control Hub
connections:
- Cassandra destination
- SFTP/FTP/FTPS enhancement - The SFTP/FTP/FTPS connection allows configuring the new SFTP/FTP/FTPS proxy properties.
- When Data Collector version 5.6.0 is registered with Control Hub cloud or with Control Hub
on-premises version 3.19.x or later, the following stage
supports using Control Hub
connections:
- Additional enhancements
-
- Enabling HTTPS for Data Collector - You can now store the keystore file in the Data Collector resources directory,
$SDC_RESOURCES
, and then enter a path relative to that directory when you define the keystore location. This can have upgrade impact. - Google Secret Manager enhancement - You can configure a new
enforceEntryGroup
Google Secret Manager credential store property to validate a user’s group against a comma-separated list of groups allowed to access each secret.
- Enabling HTTPS for Data Collector - You can now store the keystore file in the Data Collector resources directory,
- Testing update
- With this release, StreamSets no longer tests Data Collector against Cloudera CDH 5.x, which has been deprecated.
Upgrade Impact
- Enabling HTTPS for Data Collector
- With this release, when you enable HTTPS for Data Collector, you can store the keystore file in the Data Collector resources directory,
$SDC_RESOURCES
. You can then enter a path relative to that directory when you define the keystore location in the Data Collector configuration file. - Tableau CRM destination write behavior change
- The write behavior of the Tableau CRM destination, previously known as the Einstein Analytics destination, has changed.
4.2.1 Fixed Issues
- To address recently-discovered vulnerabilities in Apache Log4j 2.16.x and earlier 2.x versions, Data Collector 4.2.1 is packaged with Log4j 2.17.0. This is the latest available Log4j version, and contains fixes for all known issues.
- Data Collector now sets a Java system property to help address the Apache Log4j known issues.
- New permissions validation for the Oracle CDC Client origin added in Data Collector 4.2.0 are too strict. This fix returns the permissions validation to the same level as 4.1.x.
4.2.0 Fixed Issues
- Oracle CDC Client origin pipelines can take up to 10 minutes to shut down due to Oracle driver and executor timeout policies. With this fix, those policies are bypassed while allowing all processes to complete gracefully.
- The Oracle CDC Client origin can miss recovering transactional data when the pipeline unexpectedly stops when the origin is processing overlapping transactions.
- The JDBC Producer destination does not properly write to partitioned PostgreSQL database tables.
- The MongoDB destination cannot write null values to MongoDB.
- The Salesforce Lookup processor does not properly handle SOQL queries that include single quotation marks.
- Pipeline performance suffers when using the Azure Data Lake Storage Gen2 destination to write large batches of data in the Avro data format.
- The MapR DB CDC origin does not properly handle records with deleted fields.
- When configured to return only the first of multiple return values, the Couchbase Lookup processor creates multiple records instead.
- The Tableau CRM destination, previously known as the Einstein Analytics destination, signals Salesforce to process data after each batch, effectively treating each batch as a dataset. This fix can have upgrade impact.
4.2.x Known Issues
There are no important known issues at this time.
4.1.x Release Notes
The Data Collector 4.1.0 release occurred on August 18, 2021.
New Features and Enhancements
- Use the StreamSets platform to access Data Collector
- Existing customers can continue to access Data Collector downloads using the StreamSets Support Portal.
- New stage
-
- Google Cloud Storage executor - You can use this executor to create new objects, copy or move objects, or add metadata to new or existing objects.
- Stage type enhancements
-
- Amazon stages - When you configure the Region property, you can select from several additional regions.
- Kudu stages - The default value for the Maximum Number
of Worker Threads property is now 2. Previously, the default was 0,
which used the Kudu default.
Existing pipelines are not affected by this change.
- Orchestration stages - You can use an expression when you configure the Control Hub URL property in orchestration stages.
- Salesforce stages - All Salesforce stages now support using version 52.2.0 of the Salesforce API.
- Scripting processors - In the Groovy Evaluator, JavaScript Evaluator, and Jython Evaluator processors, you can select the Script Error as Record Error property to have the stage handle script errors based on how the On Record Error property is configured for the stage.
- Origin enhancements
-
- Google Cloud Storage origin - You can configure post processing actions to take on objects that the origin reads.
- MySQL Binary Log origin - The origin now recovers automatically from
the following issues:
- Lost, damaged, or unestablished connections.
- Exceptions raised from MySQL Binary Log being out-of-sync in some cluster nodes, or from being unable to communicate with the MySQL Binary Log origin.
- Oracle CDC Client origin:
- The origin includes a Batch Wait Time property that determines how long the origin waits for data before sending an empty or partial batch through the pipeline.
- The origin provides additional LogMiner metrics when you monitor a pipeline.
- RabbitMQ Consumer origin - You can configure the origin to read from
quorum queues by adding
x-queue-type
as an Additional Client Configuration property and setting it toquorum
.
- Processor enhancements
-
- SQL Parser processor - You can configure the processor to use the Oracle PEG parser instead of the default parser.
- Destination enhancements
-
- Google BigQuery (Legacy) destination - The destination now supports writing Decimal data to Google BigQuery Decimal columns.
- MongoDB destination - You can use the Improve Type Conversion property to improve how the destination handles date and decimal data.
- Splunk destination - You can use the Additional HTTP Headers property to define additional key-value pairs of HTTP headers to data written to Splunk.
- Credential stores
-
- New Google Secret Manager support - You can use Google Secret Manager as a credential store for Data Collector.
- CyberArk enhancement - You can configure the
credentialStore.cyberark.config.ws.proxyURI
property to allow defining the URI for the proxy that should be used to reach the CyberArk services.
- Enterprise stage libraries
- In October 2021, StreamSets released the following new Enterprise stage
library:
- Connections when registered with Control Hub
-
- When Data Collector version 5.6.0 is registered with Control Hub cloud or with Control Hub
on-premises version 3.19.x or later, the following stages
support using Control Hub
connections:
- MongoDB stages
- RabbitMQ stages
- Redis stages
- Salesforce enhancement - The Salesforce connection includes the
following role properties:
- Use Snowflake Role
- Snowflake Role Name
- When Data Collector version 5.6.0 is registered with Control Hub cloud or with Control Hub
on-premises version 3.19.x or later, the following stages
support using Control Hub
connections:
- Stage libraries
- This release includes the following new stage library:
Stage Library Name Description streamsets-datacollector-apache-kafka_2_8-lib For Apache Kafka 2.8.0. - Additional enhancements
-
- Excel data format enhancement - Stages that support reading the Excel data format include an Include Cells With Empty Value property to include empty cells in records.
4.1.0 Fixed Issues
- Due to an issue with an underlying library, HTTP connections can fail when Keep-Alive is disabled.
- Stages that need to parse a large number of JSON, CSV, or XML files might exceed the file descriptors limit because the stages don't release them appropriately.
- Data Collector does not properly handle Avro schemas with nested Union fields.
- Errors occur when using HBase stages with the CDH 6.0.x - 6.3.x or CDP 7.1 stage libraries when the HBase column name includes more than one colon (:).
- When the HTTP Lookup processor paginates by page number, it can enter an endless retry loop when reading the last page of data.
- The JDBC Lookup
processor does not support expressions for table names when validating column
mappings. Note: Validating column mappings for multiple tables can slow pipeline performance because all table columns defined in the column mappings must be validated before processing can begin.
- The Kudu Lookup processor and Kudu destination do not release resources under certain circumstances.
- When reading data with a query that uses the MAX or MIN operators, the SQL Server CDC Client origin can take a long time to start processing data.
4.1.x Known Issues
There are no important known issues at this time.
4.0.x Release Notes
- 4.0.2 - June 23, 2021
- 4.0.1 - June 7, 2021
- 4.0.0 - May 25, 2021
New Features and Enhancements
- Stage enhancements
-
- Control Hub orchestration stages - Orchestration stages use API credentials to connect to Control Hub in the StreamSets platform. This affects the following stages:
- Kafka stages - Kafka stages include an Override Stage
Configurations property that enables custom Kafka properties defined
in the stage to override other stage properties.
This can impact existing pipelines.
- MapR Streams stages - MapR Streams stages also include an Override
Stage Configurations property that enables the additional MapR or
Kafka properties defined in the stage to override other stage
properties.
This can impact existing pipelines.
- Salesforce stages - The Salesforce origin, processor, destination,
and the Tableau CRM destination include the following new timeout
properties:
- Connection Handshake Timeout
- Subscribe Timeout
- Oracle CDC Client origin:
- You can now remove Oracle pseudocolumns from parsed redo logs and place the information in record header attributes.
- The origin includes an
oracle.cdc.oracle.pseudocolumn.<pseudocolumn name>
attribute for each pseudocolumn in the original statement. - Starting with version 4.0.1, the origin includes a Batch Wait Time property.
- Field Type Converter processor - The Source Field is Empty property enables to you specify the action to take when an input field is an empty string.
- HTTP Client processor:
- Two Pass Records properties allow you to pass a record through the pipeline when all retries fail for per-status actions and for timeouts.
- The following record header attributes are populated when you
use one of the Pass Records properties:
- httpClientError
- httpClientStatus
- httpClientLastAction
- httpClientTimeoutType
- httpClientRetries
- SQL Parser processor:
- You can now remove Oracle pseudocolumns from parsed redo logs and place the information in record header attributes.
- The processor includes an
oracle.cdc.oracle.pseudocolumn.<pseudocolumn name>
attribute for each pseudocolumn in the original statement.
- Connections when registered with Control Hub
- When Data Collector version 4.0.0 is registered with Control Hub cloud or with Control
Hub on-premises version 3.19.x or later, the following stages support
using Control Hub
connections:
- Oracle CDC Client origin
- SQL Server CDC Client origin
- SQL Server Change Tracking Client origin
- Enterprise stage libraries
- In June 2021, StreamSets released new versions of the Databricks and Snowflake Enterprise stage libraries.
- Additional features
-
- SDC_EXTERNAL_RESOURCES environment variable - An optional root
directory for external resources, such as custom stage libraries,
external libraries, and runtime resources.
The default location is $SDC_DIST/externalResources.
- Support Bundle - Support bundles now include the System Messages log file when you include log files in the bundle.
- SDC_EXTERNAL_RESOURCES environment variable - An optional root
directory for external resources, such as custom stage libraries,
external libraries, and runtime resources.
- Deprecated features
- Several features and stages have been deprecated with this release and may be removed in a future release. We recommend that you avoid using these features and stages. For a full list, click here.
Upgrade Impact
- Conflicting properties in Kafka and MapR Streams stages
- In previous releases, if you specify an additional configuration property that conflicts with a stage property setting in a Kafka or MapR Streams stage, the stage property takes precedence.
- Control Hub On-premises prerequisite task
- Before using Data Collector 4.0.0 or later versions with Control Hub On-premises, you must complete a prerequisite task. For details, see the StreamSets Support portal.
- HTTP Client processor batch wait time change
- With this release, the HTTP Client processor performs additional checks against the specified batch wait time. This can affect existing pipelines. For details, see Review HTTP Client Processor Pipelines.
- Open source status
- Data Collector 4.0.0 and later versions are not open source. This means that StreamSets will not make the source code publicly available.
- Stages removed
-
The following stages have been deprecated for several years and have been removed from Data Collector with this release:
- HTTP to Kafka origin
- SDC RPC to Kafka origin
- UDP to Kafka origin
- Updated environment variable default (tarball installation, manual start)
- For manually-started tarball installations, the default location for the SDC_RESOURCES environment variable has changed from $SDC_DIST/resources to $SDC_EXTERNAL_RESOURCES/resources, which evaluates to: $SDC_DIST/externalResources/resources.
4.0.2 Fixed Issues
- The JDBC Producer destination can round the scale of numeric data when it performs multi-row operations while writing to SQL Server tables.
- You cannot use API user credentials in Orchestration stages.
4.0.1 Fixed Issue
- In the JDBC Lookup
processor, enabling the Validate Column Mappings property when using an
expression to represent the lookup table generates an invalid SQL
query.
Though fixed, using column mapping validation with an expression for the table name requires querying the database for all column names. As a result, the response time can be slower than expected.
4.0.0 Fixed Issues
- The SQL Server CDC Client origin does not process data correctly when configured to generate schema change events.
-
The Hadoop FS destination stages fail to recover temporary files when the directory template includes pipeline parameters or expressions.
- The Oracle CDC Client origin can generate an exception when trying to process data from a transaction after the same partially-processed transaction has already been flushed after exceeding the maximum transaction length.
- The Oracle CDC Client origin fails to start when it is configured to start from a timestamp or SCN that is contained in multiple database incarnations.
- Some conditional expressions in the Field Mapper processor can cause errors when operating on field names.
- HTTP Client stages should not log the proxy password when the Data Collector logging mode is set to Debug.
-
HTTP Client Processor can create creating duplicate requests when Pagination Mode is set to None.
-
The MQTT Subscriber origin does not properly restore a persistent session.
- The Oracle CDC Client origin generates an exception when Oracle includes an empty string in a redo log statement, which is unexpected. With this fix, the origin interprets empty strings as NULL.
- Data Collector uses a Java version specified in the PATH environment variable over the version defined in the JAVA_HOME environment variable.
4.0.x Known Issues
There are no important known issues at this time.