Release Notes

6.1.x Release Notes

The Data Collector 6.1.x releases occurred on the following dates:

6.1.1 on February 25, 2025
6.1.0 on January 30, 2025

New Features and Enhancements

Stage enhancements

Stages that use Amazon Web Services - Stages that use AWS can connect to endpoints and regions and can assume a role to connect to AWS on all connections. You can configure a separate region and endpoint for the stage and the assumed role.
The following stages can access AWS:
- Amazon S3 origin, destination, and executor
- Amazon SQS Consumer origin
- Databricks destination
- Databricks Query executor
- Encrypt and Decrypt Fields processor
- Kinesis Consumer origin
- Kinesis Firehose destination
- Kinesis Producer destination
- Teradata destination
This change has upgrade impact.
Kafka stages - The Kafka Multitopic Consumer origin and Kafka Producer destination now support Apache Kafka 3.8.x.
Kafka Producer destination - You can configure the destination to propagate Data Collector record headers to Kafka message headers by enabling the Propagate Headers property. In new pipelines, this property is enabled by default. In pipelines upgraded from an earlier version of Data Collector, this property is disabled so that there is no behavior change in upgraded pipelines.
MongoDB Atlas CDC Client origin - You can configure the action the origin takes when it is unable to retrieve the full record for updates.
Snowflake and Teradata stages - The Connection property on the Staging tab of Snowflake and Teradata stages has been renamed to include the staging location. For example, the property name for a stage using an Amazon S3 staging location is Amazon S3 Connection.

This change affects the following stages:
- Snowflake Bulk origin
- Snowflake destination
- Teradata destination
Start Jobs origin and processor - You can configure the stage to generate a single record per job when multiple jobs are started.
Web Client stages - The Web Client origin, processor, and destination allow you to configure per-status actions for unknown HTTP statuses.

Stage libraries

This release includes the following new stage libraries:


Stage Library	Description
streamsets-datacollector-apache-kafka-lib	For Kafka
streamsets-datacollector-file-transfer-lib	For SFTP/FTP/FTPS
streamsets-datacollector-http-lib	For HTTP

All versioned Kafka stage libraries have been removed and replaced with a single stage library, streamsets-datacollector-apache-kafka-lib, for all supported Kafka versions.
The following stages have been removed from the basic stage library and added to the new SFTP/FTP/FTPS stage library, streamsets-datacollector-file-transfer-lib:
- SFTP/FTP/FTPS Client origin
- SFTP/FTP/FTPS Client destination
- SFTP/FTP/FTPS Client executor
This change has upgrade impact.
The following stages have been removed from the basic stage library and added to the new HTTP stage library, streamsets-datacollector-http-lib:
- HTTP Client origin, processor, and destination
- HTTP Router processor
- HTTP Server origin
This change has upgrade impact.

Connections

With this release, the following stage supports using Control Hub connections:

Splunk destination

Additional changes

The following properties have been renamed to reflect the IBM inclusive naming policy. The functionality of these properties has not changed.
- Aerospike destination:
  - Commit Master is now Commit Primary
- Data Collector configuration:
  - system.stagelibs.whitelist is now system.stagelibs.allowlist
  - system.stagelibs.blacklist is now system.stagelibs.blocklist
  - user.stagelibs.whitelist is now user.stagelibs.allowlist
  - user.stagelibs.blacklist is now user.stagelibs.blocklist
- Encrypt and Decrypt Fields processor:
  - Master Key Provider is now Primary Key Provider
- Kudu Lookup processor and Kudu destination:
  - Kudu Masters is now Kudu Primary Nodes
AWS Secrets Manager credential store - The credential store now supports Instance Metadata Service version 2 when using instance profile authentication for the credential store. This change has upgrade impact.
Data Collector configuration - You can configure the number of seconds Data Collector waits between clearing expression language caches by using the static.cache.cleanup.interval.ms property.
Hashicorp Vault credential store - The credential store has added support for Azure Key Vault version 2.
Microsoft Azure Key Vault credential store - The credential method property must be configured to use either client key or managed identity authentication. If the property is not configured or is set to a different value, it generates an error. This change has upgrade impact.

Upgrade Impact

Review Microsoft Azure Key Vault credential store configuration

In version 6.0.x, Data Collector uses client key authentication for an Azure Key Vault credential store if the credential method property is not configured. Starting with version 6.1.0, Data Collector generates an error if the credential method property is not configured with either client key or managed identity authentication.

After upgrading to version 6.1.x from version 6.0.x, review the configuration file for upgraded Azure Key Vault credential stores to ensure the credential method property is properly configured.

Install the new SFTP/FTP/FTPS stage library for upgraded SFTP/FTP/FTPS stages

Starting with Data Collector 6.1.0, SFTP/FTP/FTPS stages have been removed from the basic stage library and added to the new SFTP/FTP/FTPS stage library.

After upgrading to version 6.1.0, if you have upgraded pipelines using an SFTP/FTP/FTPS origin, destination, or executor, you must install the streamsets-datacollector-file-transfer-lib stage library.

Install the new HTTP stage library for upgraded HTTP stages

Starting with Data Collector 6.1.0, HTTP stages have been removed from the basic stage library and added to the new HTTP stage library.

After upgrading to version 6.1.0, install the streamsets-datacollector-http-lib stage library if you have upgraded pipelines using any of the following stages:

HTTP Client origin, processor, and destination
HTTP Router processor
HTTP Server origin

Review EC2 instance configuration for AWS Secrets Manager credential stores

Starting with Data Collector 6.1.0, AWS Secrets Manager credential stores support Instance Metadata Service version 2. For version 2, Amazon recommends setting the hop limit to 2 in container environments to avoid delays.

If Data Collector runs in a container environment on an Amazon EC2 instance with Instance Metadata Service version 2 and uses instance profile authentication for an AWS Secrets Manager credential store, after upgrading to Data Collector 6.1.0, set the Instance Metadata Service hop limit to 2. For more information, see the Amazon EC2 documentation.

Review region and endpoint configuration for upgraded stages using Amazon Web Services

Starting with Data Collector 6.1.0, you can configure a region and endpoint for the Assume Role property in stages using AWS. In previous versions, the assumed role uses the same regional endpoint configured for the stage. Upgraded stages configured to assume a role get the region and endpoint for the assumed role from the AWS service it connects to. If those are not available, the assumed role uses the global endpoint for assumed roles.

After upgrading to version 6.1.0, review upgraded stages using AWS to ensure all stages configured to assume a role are configured with the appropriate region and endpoint.

6.1.1 Fixed Issues

Several vulnerabilities of Critical, High and Medium severity have been fixed.
Pipelines upgraded from Data Collector 5.12.x or earlier that use a Web Client processor or destination generate the following validation error:
```
VALIDATION_0008 - Invalid configuration 'configuration.response.continueWithoutData'
```

6.1.0 Fixed Issues

Several vulnerabilities of Critical, High and Medium severity have been fixed.
When Azure Data Lake Storage and Blob Storage stages are configured to order files by timestamp, and if the time it takes to read and order all the available files is longer than the configured max batch wait time, Data Collector can send no-more-data events before any data is read.
HTTP Client stages ignore per-status actions that apply to records that contain an empty body.
Data Collector cannot verify Kafka connections.
Oracle Bulkload origins do not process any records when the Minimum Task Size property is configured.

6.0.x Release Notes

The Data Collector 6.0.0 release occurred on October 31, 2024.

New Features and Enhancements

Stage enhancements

Azure Blob Storage origin - The default spooling period is increased to 30 seconds. This change has an upgrade impact.
MongoDB Atlas CDC origin - When the origin is configured to read from a MongoDB change stream, the origin supports the REPLACE CRUD operation type.
Web Client origin, processor, and destination:
- You can configure Web Client stages to log HTTP requests and responses. These properties are disabled by default.
- The option to specify a status failure action for a per-status action only appears for retry actions.
Snowflake destination - The Replicate Decimal Columns property has been removed. The destination now always creates Decimal columns for decimal data.

Stage libraries

This release includes the following new stage library:


Stage Library	Description
streamsets-datacollector-cdp_7_1_9-lib	For Cloudera CDP 7.1.9

The Pulsar stage library is renamed from streamsets-datacollector-apache-pulsar_2-lib to streamsets-datacollector-apache-pulsar-lib.

Additional changes

With this release, Data Collector no longer supports Java 8. This change has an upgrade impact.
With this release, the standalone Data Collector UI has been removed. After installation or upgrade, all Data Collectors must be registered with Control Hub. This change has an upgrade impact.
The following stages, which were deprecated in earlier versions, have been removed:
- ADLS Gen1 File Metadata executor
- Aerospike destination
- Azure Data Lake Storage (Legacy) destination
- Azure Data Lake Storage Gen1 origin and destination
- Dev SDC RPC with Buffering origin
- Flume destination
- Google Big Query (Legacy) destination
- Hadoop FS origin
- Hive Streaming destination
- Kafka Consumer origin
- KineticDB destination
- MapR FS origin
- NiFi HTTP Server origin
- Omniture origin
- Start Pipelines origin
- Start Pipelines processor
- SDC RPC origin and destination
- SQL Server 2019 BDC Bulk Loader destination
- SQL Server 2019 BDC Multitable Consumer origin
- Spark Evaluator processor
- Value Replacer processor
- Wait For Pipelines processor
This change has an upgrade impact.
The following features, which were deprecated in earlier versions, have been removed:
- Aggregated pipeline statistics
- Cluster pipelines
- SDC RPC pipelines
This change has an upgrade impact.
Antenna Doctor has been removed. This change has an upgrade impact.
HashiCorp Vault credential store - The authMethod property must be set to one of the following values:
- appId
- appRole
- azure
This change has an upgrade impact.

Upgrade Impact

Verify that you have a supported Java version installed on the engine machine

Previous versions of Data Collector supported Java 8 in addition to Java 11 and Java 17. Starting with version 6.0, Data Collector supports only Java 11 and Java 17. For more information, see Java Versions and Available Features.

To migrate to Java 11 or 17, complete the following steps before upgrading to Data Collector 6.0:

Shut down Data Collector.
Install Java 11 or 17 on the Data Collector machine.
Restart Data Collector and verify that it works as expected.
If any pipelines include the JavaScript Evaluator processor, open the pipelines and validate the scripts on the new Java version.

Review pipelines that use removed functionality

A number of stages and features that were deprecated in earlier versions of Data Collector have been removed with version 6.0. Functionality has been removed because it is not commonly used, has been replaced, is no longer maintained, or connects to a system with an end-of-service date.

The following tables list suggested alternatives for the functionality that has been removed:


Removed Feature	Details/Alternatives
Aggregated pipeline statistics	There is no specific alternative for this feature.
Cluster pipelines and cluster-only origins	You can use StreamSets Transformer instead. For more information, see the Transformer documentation. There may be some cases where Transformer does not currently achieve full feature parity with cluster pipelines. StreamSets will work with customers to achieve feature parity on a case-by-case basis, as needed.
Data Collector user interface	Use the Control Hub user interface to design and run Data Collector pipelines.
SDC RPC pipelines, including SDC RPC stages	To orchestrate data between pipelines, try using orchestration pipelines or an enterprise orchestration tool. To stream data to downstream consumers, try using the Kafka Multitopic Consumer origin, or other streaming origins such as the TCP Server, UDP Source, or HTTP Server origins.


Removed Origin	Details/Alternatives
Azure Data Lake Storage Gen1	Azure Data Lake Storage Gen1 has been retired by Microsoft. Trying switching to Azure Data Lake Storage Gen2 as recommended by Microsoft. Then, you can use the Azure Data Lake Storage Gen2 origin, available in Data Collector and Transformer.
Hadoop FS	Use the Transformer File origin for cluster workloads.
Kafka Consumer	Use the Kafka Multitopic Consumer which supports higher versions of the Kafka API.
MapR FS	Use the Transformer File origin for cluster workloads.
NiFi HTTP Server	There is no specific alternative for this origin.
Omniture	There is no specific alternative for this origin.
SDC RPC	To orchestrate data between pipelines, try using orchestration pipelines or an enterprise orchestration tool. To stream data to downstream consumers, try using the Kafka Multitopic Consumer origin, or other streaming origins such as the TCP Server, UDP Source, or HTTP Server origins.
SQL Server 2019 BDC Multitable Consumer	SQL Server 2019 BDC has been retired by Microsoft. There is no specific alternative.
Start Pipelines	This origin only orchestrates pipelines that are not tied to Control Hub. Register your Data Collector with Control Hub and use the Start Jobs origin instead.
Teradata Consumer	There is no specific alternative for this origin.


Removed Processor	Details/Alternatives
Spark Evaluator	Use Transformer.
Start Pipelines	This processor only orchestrates pipelines that are not associated with Control Hub. Register your Data Collector with Control Hub and use the Start Jobs processor instead.
Value Replacer	Use the Field Replacer processor.
Wait for Pipelines	This processor only orchestrates pipelines that are not associated with Control Hub. Register your Data Collector with Control Hub and use the Wait for Jobs processor instead.


Removed Destination	Details/Alternatives
Aerospike	Use the Aerospike Client destination.
Azure Data Lake Storage Legacy and Gen1	Azure Data Lake Storage Gen1 has been retired by Microsoft. Try switching to Azure Data Lake Storage Gen2 as recommended by Microsoft. Then, you can use the Azure Data Lake Storage Gen2 destination, available in Data Collector and Transformer.
Flume	Cloudera has removed Flume from CDP 7.0. Earlier versions that included Flume are now end-of-life and no longer supported. Thus, we are deprecating this destination and have no specific alternative. You might switch to alternative technologies such as Kafka or Pulsar, and then use the Kafka Producer or Pulsar Producer destinations.
Google BigQuery (Legacy)	Use the Google BigQuery destination which can process CDC data and handle data drift.
GPSS Producer	There is no specific alternative for this destination.
Hive Streaming	Use the drift synchronization solution for Hive.
KineticaDB	There is no specific alternative for this destination.
MemSQL Fast Loader	There is no specific alternative for this destination.
SDC RPC	To orchestrate data between pipelines, try using orchestration pipelines or an enterprise orchestration tool. To stream data to downstream consumers, try using the Kafka Multitopic Consumer origin, or other streaming origins such as the TCP Server, UDP Source, or HTTP Server origins.
SQL Server 2019 BDC Bulk Loader	SQL Server 2019 BDC has been retired by Microsoft. Try switching to Azure Synapse, as recommended by Microsoft. Then, you can use the Azure Synapse SQL destination, available in Data Collector.

After upgrading to version 6.0, verify that no pipelines are using the removed stages.

Removed Antenna Doctor

In previous versions of Data Collector, you could configure Antenna Doctor to suggest potential fixes and workarounds to common pipeline issues. Starting with version 6.0, Antenna Doctor is not included in Data Collector.

After upgrading to version 6.0, Antenna Doctor does not send pipeline messages.

Review Azure Blob Storage origins

In previous versions of Data Collector, the default spooling period for Azure Blob Storage origins was 5 seconds. Starting with version 6.0, the default spooling period is 30 seconds, and upgraded origins are given the new default value of 30 seconds.

After upgrading to version 6.0, review upgraded Azure Blob Storage origins to ensure they are configured with an appropriate spooling period.

Review HashiCorp Vault credential stores

In previous versions of Data Collector, the HashiCorp Vault credential store authMethod property could be left empty. Starting with version 6.0, the authentication method must be set to one of the following values:

appId
appRole
azure

After upgrading to version 6.0, verify that the authentication method for upgraded credential stores is set to a valid value.

6.0.0 Fixed Issues

With the removal of cluster support, Data Collector has removed all of its related vulnerable dependencies. As a result, Data Collector no longer shows security vulnerability alerts that were arising due to the cluster support.
The Azure Data Lake Storage Gen2 origin fails with a HADOOPFS_13 error when using Java 17.
Snowflake stages configured with a custom JDBC URL show a validation error when you try to configure the database, warehouse, or schema properties.
Databricks Job Launcher executor, Databricks Query executor, and Databricks Delta Lake destination pipelines sometimes start a Databricks cluster when the pipeline isn’t running.
The Web Client processor and destination requests sometimes result in high CPU usage.
Web Client stages show the Final Offset and Final Page properties even when those properties don’t apply to the configured pagination type.
When you enter a time function for the FTP/FTP/FTPS Client destination file name expression, the destination generates an error evaluating expression error message.

Known Issue

Upgraded pipelines using a Web Client processor or destination generate the following validation error:
```
VALIDATION_0008 - Invalid configuration
'configuration.response.continueWithoutData'
```

5.12.x Release Notes

The Data Collector 5.12.0 release occurred on August 2, 2024.

New Features and Enhancements

Stage enhancements

JDBC Lookup processor - You can configure the processor to enforce query validation, which improves memory management.

Stages removed

ML Evaluator processor - This stage, which was deprecated in an earlier version, has been removed.
This change has an upgrade impact.

Additional changes

Oracle Multitable Consumer origin and Oracle destination - The Oracle JDBC driver is no longer included in the streamsets-datacollector-jdbc-branded-oracle-lib stage library. Before using these stages, you must install the driver as an external library into the stage library.
This change has an upgrade impact.

Upgrade Impact

Removed Databricks ML Evaluator processor

The Databricks ML Evaluator processor, which was deprecated in an earlier version, has been removed from Data Collector with version 5.12.0. As an alternative, you can use StreamSets Transformer. For more information, see the Transformer documentation.

After upgrading to version 5.12.0, verify that no pipelines are using the Databricks ML Evaluator processor.

Install the Oracle JDBC driver for upgraded Oracle Multitable Consumer origins and Oracle destinations

Starting with version 5.12.0, the Oracle JDBC driver is no longer included in the streamsets-datacollector-jdbc-branded-oracle-lib stage library. You must manually install the driver into the stage library before using an Oracle Multitable Consumer origin or Oracle destination.

After upgrading to version 5.12.0, if you have upgraded pipelines using an Oracle Multitable Consumer origin or Oracle destination, install the Oracle JDBC driver as an external library for the streamsets-datacollector-jdbc-branded-oracle-lib stage library.

5.12.0 Fixed Issue

You can choose whether the processor enforces query validation. Query validation improves memory management and is enabled by default.

5.11.x Release Notes

The Data Collector 5.11.0 release occurred on July 4, 2024.

New Features and Enhancements

New stages

Jira origin - The new origin reads data from a Jira instance. This origin is a Technology Preview feature. It is not meant for use in production.
Jira destination - The new destination writes data to a Jira instance. This destination is a Technology Preview feature. It is not meant for use in production.
Oracle Multitable Consumer origin - The new origin reads data from multiple Oracle database tables.

Stage enhancements

Databricks Query executor - The Spark SQL Query property has been renamed SQL Queries to better indicate that you can define multiple queries in the property. The functionality of the property has not changed.
Databricks Query, JDBC Query, and Snowflake executors:
- You can configure the maximum number of attempts the executor tries to make a successful connection and the amount of time the executor waits between each attempt.
- The Max Attempts property has been renamed Max Connection Attempts to better indicate the action being attempted. The functionality of the property has not changed.
Directory origin - You can configure the origin to retrieve files based on file system events instead of polling for new files.
HDFS File Metadata executor - The Hadoop FS Configuration property supports expression language and parameters in the Name field. It also supports credentials, expressions language, and parameters in the Value field.
Kafka stages - The Kafka Consumer origin, Kafka Multitopic Consumer origin, and Kafka Producer destination now support Apache Kafka 3.7.x.
Kafka Producer destination - The Key Serializer property has been removed. Data Collector automatically uses the appropriate key serializer for the configured message key format.
MongoDB Atlas stages - The MongoDB Atlas origin, MongoDB Atlas Lookup processor, and MongoDB Atlas destination now support MongoDB Atlas 7.x.
Oracle CDC Client origin - Session integrity verification has been improved.
Oracle destination - The destination is no longer a technology preview feature and can be used in production environments.
Snowflake Bulk origin - You can configure the maximum number of downloaded files in the origin’s processing queue.
Snowflake stages - The Snowflake Bulk origin, Snowflake Query executor, Snowflake destination, and Snowflake File Uploader destination support the TIMESTAMP_LTZ data type.
Snowflake Bulk origin, Snowflake executor, and Snowflake destination - The Override Database, Override Schema, and Override Warehouse properties have been moved to the Snowflake Connection Info tab.
Snowflake executor - You can configure whether the stage treats queries as case-sensitive. By default, queries are not case-sensitive.
Snowflake File Uploader destination - The Database, Schema, and Warehouse properties have been removed. This change has an upgrade impact.
Start Jobs processor - The following properties have been renamed to better reflect that these values are Control Hub credential ID and token:
- Auth ID is now Cred ID
- Password is now Cred Token
The functionality of these properties has not changed.

Teradata destination - The destination is no longer a technology preview feature and can be used in production environments.
Web Client stages:
- The Offset in Header pagination option has been renamed Offset to better reflect what the option does. The functionality of the option has not changed.
- You can configure the stage to continue pagination even after a page returns no data.
- You can configure per-status actions for status code groups as well as individual status codes. The following properties for specific status code groups have been removed:
  - Accept Informational Statuses (1xx)
  - Accept Successful Statuses (2xx)
  - Accept Redirection Statuses (3xx)
- The Request Endpoint property supports regular expressions.
- All Web Client stages evaluate common headers.
- You can configure retry behavior for the following transient errors:
  - Connection Error
  - No Route To Host
  - Unknown Host
  - Unreachable Port
- You can configure all Web Client stages to log requests and responses.
Web Client origin - The Polling Interval property has been removed.
Web Socket Client origin - You can configure the maximum message size for the origin, up to 10 MB.

Stage libraries

This release includes the following new stage library:


Stage Library	Description
streamsets-datacollector-hpe_edf_7_2-eep_9_2-lib	For HPE 7.2.x with EEP 9.2

Additional enhancements

Default Java version - Data Collector Docker images are now bundled with OpenJDK 17 by default.

Previously, Data Collector Docker images were bundled with OpenJDK 8.0.372.
Email notifications - The format of pipeline notification emails has changed to include a new error code format. This enhancement has an upgrade impact.
Hashicorp Vault - Improved authentication support for Azure. This enhancement is currently supported only for KV1 type secret engines.

Upgrade Impact

Review pipelines with Google BigQuery or Snowflake destinations writing JSON data

Starting with version 5.11.0, you cannot configure characters to represent null values or newline characters for Google BigQuery or Snowflake destinations when writing JSON data. Upgraded destinations do not change null values or newline characters.

After you upgrade to version 5.11.0, review pipelines using Google BigQuery or Snowflake destinations writing JSON data to ensure the destination does not receive any null values or newline characters from the pipeline that should not be passed to the external system.

Review Snowflake File Uploader staging details

Starting with version 5.11.0, the Database and Schema properties have been removed from the Snowflake File Uploader destination. In Data Collector 5.10.0, if the Stage Database and Stage Schema properties were not configured, the destination used the database and schema values configured for the destination or Control Hub connection instead.

When upgrading from version 5.10.0 to 5.11.0, Snowflake File Uploader destinations that do not use a Control Hub connection and do not have values configured for the Stage Database or Stage Schema properties are assigned a staging database value equal to the configured database value and a staging schema value equal to the configured schema value. Snowflake File Uploader destinations that use a Control Hub connection and do not have values configured for the Stage Database or Stage Schema properties are not assigned any values for these properties and must have them configured after upgrading.

After you upgrade to version 5.11.0 from version 5.10.0, review Snowflake File Uploader destinations that use Control Hub connections and make sure the Stage Database and Stage Schema properties are configured.

Review pipeline notification email configurations

Starting with version 5.11.0, the format of pipeline notification emails has changed to include a new error code format.

After you upgrade to version 5.11.0, review notification email configurations for upgraded pipelines to ensure they behave as expected.

Review the batch wait time for Directory origins

Starting with version 5.11.0, the origin correctly interprets the batch wait time value as seconds. In earlier releases, the origin incorrectly interpreted the value as milliseconds.

After you upgrade to version 5.11.0, review Directory origins to ensure they are configured with an appropriate batch wait time.

5.11.0 Fixed Issues

Google BigQuery and Snowflake destinations allow configuring characters to represent null values and newline characters for the JSON data format even though the format does not require it.

This fix has an upgrade impact.
The JDBC Lookup processor fails to validate column mappings if the field is configured using the expression language.
The JDBC Lookup processor does not parse the SQL query correctly if it is in a record field, and generates errors for missing query clauses.
The JDBC Multitable Consumer origin, JDBC Query Consumer origin, and JDBC Lookup processor do not process Long data types read from Oracle databases, causing the pipeline to fail.
The Directory origin interprets the value in the Batch Wait Time property as milliseconds instead of seconds. This fix has an upgrade impact.
The Start Jobs origin and processor can generate incorrect validation errors when using job templates.
The Oracle CDC origin can fail to parse a DDL statement when a partition of a tracked table has been shrunk.
The Oracle CDC origin incorrectly treats null columns as missing columns.
The Oracle CDC origin does not terminate all threads when a pipeline stops.
The Oracle CDC origin incorrectly reports that redo logs are in use when waiting for available redo logs.
The Oracle CDC origin generates incorrect queries when tracking less than 1000 tables and more than 1000 objects, causing pipelines to stop with error ORACLE_CDC_1020.
Pipelines using an Oracle CDC origin sometimes fail as a result of the origin incorrectly parsing complex statements.
The Kafka origin encounters a null pointer exception and stops the pipeline when it reads a Kafka message where both the message key and payload are null.
Error records cause an unintended exception in Web Client stages.
Origins using the Parquet data format add an Avro schema to the record header instead of a Parquet schema.
Origins using the Parquet data format sometimes fail to read Parquet files larger than 4 MB.
The Avro and Parquet data formats do not support fields with fixed and decimal types at the same time.
The Web Client destination sends HTTP requests twice.
Pipelines using Web Client processors or destinations sometimes fail to run as the result of unexpected null pointer exceptions.
When a Web Client processor uses offset header pagination, the processor does not return all records.
AWS Service Catalog is not available in the connection catalogue for Amazon EMR Cluster Manager Control Hub connections.
The Databricks Query, JDBC Query, and Snowflake executors do not properly report connection query exceptions, causing pipelines to stop with a null pointer exception.
Pipelines using the Schema Generator processor to generate Avro schemas stop unexpectedly when the Default to Nullable property is enabled.
The Web Socket Client origin does not properly close threads.
Pipelines using the Web Socket Client origin do not stop when the origin encounters a 401 error.
The Web Socket Client origin incorrectly generates a heap size error when the pipeline stops unexpectedly.

5.10.x Release Notes

The Data Collector 5.10.0 release occurred on April 11, 2024.

New Features and Enhancements

New stages

Azure Blob Storage destination - The new destination writes data to Azure Blob Storage.
Oracle destination - The new destination writes data to one or more tables in an Oracle database. This destination is a Technology Preview feature. It is not meant for use in production.
Web Client origin - The new origin reads data from an HTTP endpoint. This origin is a Technology Preview feature. It is not meant for use in production.
Web Client processor - The new processor sends requests to an HTTP endpoint and writes responses to records, with enhanced security options and the ability to execute concurrent requests. This processor is a Technology Preview feature. It is not meant for use in production.
Web Client destination - The new destination writes data to an HTTP endpoint. This destination is a Technology Preview feature. It is not meant for use in production.

Stage enhancements

Kafka Multitopic Consumer origin - Tombstone records now include all header types included in other records for the origin.
Kinesis Consumer origin - You can read from Amazon DynamoDB and Amazon CloudWatch through a Kinesis Virtual Private Cloud (VPC) interface endpoint. To connect to DynamoDB or CloudWatch, you can specify the region or VPC interface endpoint to use.
Kinesis Producer destination - You can configure the destination to allow multiple records to be stored in a single Kinesis Data Streams record.
Oracle Bulkload origin - The origin has improved performance when reading case-sensitive data.
Oracle CDC Client origin - The Records Cache Size property allows you to configure the maximum size of the record cache.
This enhancement has an upgrade impact.
Snowflake executor:
- The Warehouse property has moved to the Snowflake Connection Info tab. This property is required when you do not use a Control Hub connection with the executor. This property is not available when you do use a Control Hub connection with the executor.
- The new Database and Schema properties are required when you do not use a Control Hub connection with the executor. These properties are not available when you do use a Control Hub connection with the executor.
- When you use a Control Hub connection with the executor, you can configure the new Override Warehouse property to override the warehouse property specified for the connection. If a warehouse is not defined for the connection, you must define one for the executor.
Snowflake Bulk origin and Snowflake destination:
- The Warehouse, Database, and Schema properties have moved from the Snowflake tab to the Snowflake Connection Info tab. These properties are required when you do not use a Control Hub connection with the stage. These properties are not available when you do use a Control Hub connection with the stage.
- When you use a Control Hub connection with the stage, you can configure the new Override Warehouse, Override Database, and Override Schema properties to override the respective properties specified for the connection. If a warehouse, database, or schema is not defined for the connection, you must define one for the stage.

Snowflake File Uploader destination:
- When you use a Control Hub connection with the destination and a database and schema are configured for the connection, the Stage Database and Stage Schema properties are optional. When a Stage Database or Stage Schema is not specified for the destination, the destination uses the database or schema specified for the connection.
- The new Warehouse, Database, and Schema properties are required when you do not use a Control Hub connection with the destination. These properties are not available when you do use a Control Hub connection with the destination.
Snowflake stages - In the Snowflake Bulk origin, Snowflake and Snowflake File Uploader destinations, and Snowflake executor, the Use Snowflake Role property has been removed, and the Snowflake Role Name property is now the Role property. As a result, the new Role property is now always available when you are not using a Control Hub connection for the stage, and it is optional.
The functionality of the property has not changed.
Start Jobs stages - The Start Jobs origin and processor search mode property includes the following options:
- Equals - Includes jobs with names that exactly match the text entered in the Identifier property.
- Starts with - Includes jobs with names that start with the text entered in the Identifier property.
- Contains unique - Includes a single unique match with the text entered in the Identifier property.
- Contains multiple - Includes all jobs that match the text entered in the Identifier property.
This enhancement has an upgrade impact.

Web Socket Client origin - You can enable a proxy and configure proxy parameters for the origin.

Stage libraries

This release includes the following new stage libraries:


Stage Library	Description
streamsets-datacollector-apache-kafka_3_6-lib	For Kafka version 3.6.x
streamsets-datacollector-jdbc-branded-oracle-lib	For Oracle
streamsets-datacollector-webclient-impl-okhttp	For OkHttp

With this release, the following stages support using Control Hub connections:
- Azure Blob Storage destination
- Oracle destination
The Snowflake connection includes the following enhancements:
- New optional properties:
  - Database
  - Schema
  - Warehouse
    
    If any of these properties is not specified for the connection, it must be specified for stages using the connection.
- Role property changes - The Use Snowflake Role property has been removed, and the Snowflake Role Name property is now the Role property. As a result, the new Role property is now always available and is optional.
  
  The functionality of the property has not changed.
- The connection now supports private key authentication.

Upgrade Impact

Review the record cache size for Oracle CDC Client pipelines

Starting with version 5.10.0, you can configure the maximum size of the record cache for an Oracle CDC Client origin using the Records Cache Size property. Upgraded pipelines are given the default value of -2, which represents two times the batch size.

After upgrading to Data Collector 5.10.0, verify that Oracle CDC Client origins are configured with the appropriate record cache size.

Review search mode behavior for Start Jobs pipelines

Starting with version 5.10.0, Start Jobs stages have updated search mode options. Pipelines upgraded from version 5.2.x or earlier that were configured with the Contain search mode option are updated to use the new Contains Unique search mode option.

After upgrading to Data Collector 5.10.0 from Data Collector 5.2.x or earlier, verify that Start Jobs pipelines are using the appropriate search mode.

5.10.0 Fixed Issues

Imported pipelines with upgrade errors incorrectly show additional validation errors.
The Azure and Directory origins are slow to start processing directories that contain a large number of files.
The JDBC Multitable Consumer and JDBC Query Consumer origins and JDBC Lookup processor do not process long data types read from Oracle databases.
The Field Type Converter processor sometimes fails to add scale and precision headers when converting a field to Decimal.
The Oracle CDC origin treats PBD names as case-sensitive, causing pipelines to generate errors when the PBD name does not match the case of the PBD name specified in the origin.
When using a PBD for the Oracle CDC origin and the PBD character set is different from the CDB character set, NVARCHAR2 and NCHAR columns can be decoded incorrectly.
Snowflake stages configured to use Parquet data format do not process CDC data.
The Oracle Bulkload origin is sometimes prevented from parallelizing, resulting in lower performance than expected.
The SFTP/FTP/FTPS Client origin sometimes skips files due to incorrect application of the file processing delay.
The Snowflake executor gives an error when you try to add a List parameter as an SQL query.
Start Jobs stages sometimes fail to report the status of asynchronous jobs.
Pipelines containing HTTP Router processor pipeline fragments throw errors when processing incoming requests.
Databricks, Google BigQuery, Snowflake, and Teradata stages sometimes make unnecessary queries.
When an Oracle CDC origin is configured to process one or more large data types, the pipeline generates an exception when it tries to process precision data types.

The Salesforce Bulk API 2.0 origin sometimes sends the same content-header multiple times, which can cause the pipeline to stop with a stage exception error even when the pipeline validates successfully.

Known Issue

When a Snowflake destination using data drift handling with the Upper Case Schema & Field Names property enabled is configured with a lowercase table name, the pipeline is unable to read the primary key from a primary key header if one is present.
Workaround: Update the table name to a name with only uppercase letters.

5.9.x Release Notes

The Data Collector 5.9.x releases occurred on the following dates:

5.9.1 in February 2024
5.9.0 on January 29, 2024

New Features and Enhancements

New stages

Teradata destination - The new destination writes data to one or more tables in a Teradata database. This destination is a Technology Preview feature. It is not meant for use in production.

Stage enhancements

Amazon stages - The Amazon S3 and Amazon SQS Consumer origins, Amazon S3 executor, and Amazon S3 destination allow you to specify an external ID to use when assuming another role.
Azure Data Lake Gen2 (Legacy), Directory, and Hadoop FS Standalone origins - You can configure the Max Files Hard Limit property to set a limit on the maximum number of files added to the processing queue at initialization.
CONNX CDC origin - The origin now supports the groups feature added in CONNX 14.8.
Databricks Delta Lake destination - You can configure the maximum number of queries the destination makes per second, the maximum wait time between queries, the maximum number of retries for a failed query, and the number of milliseconds to wait between retries.
Directory origin - You can configure the File Processing Delay property to determine the minimum number of milliseconds that must pass from the time a file is created before it is processed.
Kafka origins - The Kafka Consumer (deprecated) and Kafka Multitopic Consumer origins allow you to configure Avro schema registry security options.
MongoDB stages - The MongoDB, MongoDB Atlas, and MongoDB Oplog origins, MongoDB Lookup and MongoDB Atlas Lookup processors, and MongoDB and MongoDB Atlas destinations support time expression language functions for the following properties:
- Max Connection Idle Time
- Max Connection Lifetime
- Max Connection Wait Time
- Socket Connect Timeout
- Socket Read Timeout
- Heartbeat Frequency
- Min Heartbeat Frequency
- Server Selection Timeout
Oracle CDC origin - The Max Batch Vault Size property allows you to configure the maximum number of batches the origin pre-generates while the pipeline is processing other batches. This enhancement has an upgrade impact.
PostgreSQL Metadata processor - You can configure whether the processor queries the origin table for primary key information or retrieves it from records. To query the origin table for primary key information, the origin table must be in the destination database and the Omits Constraints When Creating Table property must be enabled.
Start Jobs origin and processor:
- Runtime parameters are now required when starting jobs from a job template.

Stage libraries

This release includes the following new stage library:


Stage Library	Description
streamsets-datacollector-teradata-lib	For Teradata

With this release, the following stages support using Control Hub connections:
- Teradata destination

Additional enhancements

Parquet data format:
- Stages using the Parquet data format pass the Parquet schema in the parquetSchema record header attribute.
- You can configure the Parquet schema to allow nullable fields.

Upgrade Impact

Review the maximum batch vault size for Oracle CDC origin pipelines

Starting with version 5.9.0, the Oracle CDC origin has a Max Batch Vault Size property that allows you to configure the maximum number of batches the origin pre-generates while the pipeline is processing other batches. In upgraded pipelines, the origin uses the default maximum batch vault size of 64.

After you upgrade to version 5.9.0, review Oracle CDC origin pipelines. If the maximum batch vault size is not appropriate, update the pipelines accordingly.

Review Amazon, Azure, Data Parser, JMS Consumer, and Pulsar Consumer origin pipelines

Starting with version 5.9.0, the following origins no longer read tables that contain multiple columns with the same name:

Amazon S3
Amazon SQS Consumer
Azure Blob Storage
Azure Data Lake Storage Gen2
Data Parser
JMS Consumer
Pulsar Consumer
Pulsar Consumer (Legacy)

When configured to read tables that contain duplicate column names, the origin treats the tables as invalid and generates an error.

After you upgrade to version 5.9.0, review pipelines that use these origins. If any pipelines require the ability to read tables containing multiple columns with the same name, configure the origins to ignore column headers.

Review JDBC Lookup processor SQL query configuration: Starting with version 5.9.0, stability and performance improvements to the JDBC Lookup processor cause the processor to strictly enforce the requirement of a WHERE clause in SQL queries.; After you upgrade to version 5.9.0, verify that the SQL Query property for each JDBC Lookup processor is configured with a WHERE clause.

5.9.1 Fixed Issues

Several in-development stages are mistakenly released in version 5.9.0. This fix removes the following incomplete stages from Data Collector:
- Web Client origin
- Web Client processor
- Web Client destination
- GitHub origin
- GitHub processor
- GitHub destination
After upgrading to version 5.9.0, pipelines with Start Jobs stages give a validation error due to incorrectly-upgraded runtime parameters. This fix removes the runtime parameters requirement when starting jobs from a template.

5.9.0 Fixed Issues

When configured to use a proxy server, wildcard domains specified in thehttp.nonProxyHosts property are not correctly evaluated.
When the Amazon S3 origin encounters duplicate column names while processing Excel or Delimited files, it does not show an error to indicate the conflicting file has not been ingested. This change has an upgrade impact.
Azure Blob Storage and Azure Data Lake Storage Gen2 origin pipelines sometimes duplicate data when the pipeline restarts.
The PostgreSQL Metadata processor does not propagate primary key information when creating tables in a destination database that is different from the origin database.
Oracle CDC origin pipelines continue to consume CPU and memory for several minutes after the pipeline has stopped.

5.9.x Known Issues

The Start Jobs processor incorrectly throws an error when more than one job name matches the specified Starts with or Equals value.
When a Snowflake destination is configured to write to a table with a lower-case name, is configured for data drift handling, and has the Upper Case Schema & Field Names property enabled, the destination cannot read the primary key from a primary key header when one is present.
Workaround: Update the table name to a name with only upper-case letters.

5.8.x Release Notes

The Data Collector 5.8.x releases occurred on the following dates:

5.8.1 on December 18, 2023
5.8.0 on December 7, 2023

New Features and Enhancements

New stages

Couchbase origin - The new origin reads JSON data from Couchbase Server.
MongoDB Atlas CDC origin - The new origin reads data from MongoDB Atlas or MongoDB Enterprise Server. The origin can read changes from Mongo Change Streams or directly from the Oplog.

Stage enhancements

Azure Synapse SQL destination, Databricks Delta Lake destination, Google Big Query destination, and Snowflake origin and destination - The Purge Stage File After Ingesting property is selected by default. Existing pipelines are not affected by this change.
Couchbase destination:
- You can choose which stage library the destination uses on the General tab.
- You can configure the collection and scope for the destination to use on the Couchbase tab.
- For Couchbase 6.5 and later, you can configure the destination to use durable writes on the Document Handling tab.
Google Cloud Storage origin and destination - You can configure the maximum number of times that the stage retries a failed connection and the amount of time between each retry.
Oracle Bulkload origin:
- Validation for empty tables has been removed, so the origin does not stop the pipeline when it encounters and empty table. This change has an upgrade impact.
- You can configure the maximum number of seconds the origin waits before timing out.
Oracle CDC origin:
- The origin now supports Blob and Clob large data types. LOB_WRITE, LOB_TRIM and LOB_ERASE operations are supported for Blob and Clob columns.
- Data Collector publishes data lineage for the origin.
Oracle CDC Client origin - You can configure the origin to raise an error when LogMiner encounters unsupported operations.
PostgreSQL stages - The PostgreSQL CDC Client origin and PostgreSQL Metatdata processor now support the following data types:
- cidr
- inet
- interval
- JSON
- macaddr8 - Supported with PostgreSQL version 10.0.0 and later.
SFTP/FTP/FTPS Client origin - The origin now supports binary data format.
Snowflake Bulk origin:
- The origin now supports Geography and Geometry data types. The origin receives Geography data in Well-Known Text (WKT) format using the GEOGRAPHY_OUTPUT_FORMAT session parameter. Because pipelines handle Geography data as a string, functions related to Geography, such as area calculation, perimeter, and distance, are not available.
- The origin validates that configured tables exist and returns an error if it cannot find a configured table.
SQL Server CDC Client origin - The new Automatically Reset Offset After Cleanup property lets you configure the origin to automatically reset the offset after the CDC Cleanup process has been executed. By default, the property is disabled.

Stage libraries

This release includes the following new stage libraries:


Stage Library	Description
streamsets-datacollector-couchbase_3-lib	For Couchbase SDK 3.x

The existing streamsets-datacollector-couchbase_5-lib stage library has been renamed streamsets-datacollector-couchbase_2-lib to better reflect the supported Couchbase SDK version.

Starting with version 5.8.0, Data Collector no longer supports Enterprise stage libraries. This change has an Upgrade Impact.

Additional enhancements

Recommended Java versions - Data Collector Docker images are now bundled with OpenJDK 8.0.372.
For a tarball installation of Data Collector, all Java 8 versions are still supported; however StreamSets now recommends using Oracle Java JDK 8u361 or later, or OpenJDK 8.0.372 or later.
Parquet data format:
- Parquet data format is no longer considered a Technology Preview feature and is approved for use in production.
- You can select Parquet as a data format for the following origins:
  - Amazon S3
  - Azure Blob Storage
  - Azure Data Lake Storage Gen2
  - Directory
  - Google Cloud Storage
- The default schema location is now Infer From Records.

Deprecated stages

SQL Server 2019 BDC stages - With this release, the SQL Server 2019 BDC Multitable Consumer origin and SQL Server 2019 BDC Bulk Loader destination have been deprecated and may be removed in a future release.

Upgrade Impact

Review Oracle Bulkload origin pipelines

Starting with version 5.8.0, pipelines using the Oracle Bulkload origin no longer fail when the origin encounters an empty table. This change might cause Oracle Bulkload pipelines created with earlier versions of Data Collector to behave in unexpected ways.

After you upgrade to version 5.8.0, review any pipelines that use the Oracle Bulkload origin to ensure they behave as expected.

Update stages using Enterprise stage libraries to use custom stage libraries

Starting with version 5.8.0, Data Collector no longer supports Enterprise stage libraries.

After you upgrade to 5.8.0, update stages using any of the following Enterprise stage libraries by installing the stage library as a custom stage library:

GPSS
MemSQL
Protector
Microsoft SQL Server 2019 Big Data Cluster
Teradata

5.8.1 Fixed Issues

Oracle CDC pipelines sometimes stop unexpectedly with an unsupported operation error from LogMiner.
Oracle CDC pipelines stop unexpectedly if they attempt to update monitoring information when the time zone changes between standard and daylight saving time.
The Oracle CDC origin incorrectly scans Blob and Clob operations when the Enable BLOB and CLOB Columns property is disabled, causing pipelines to stop unexpectedly.
Oracle CDC pielines sometimes fail with an Unable to parse DDL statement error when reading external tables. This fix adds support for parsing additional DDL statements.
The Oracle CDC origin sometimes evicts transactions early as a result of incorrectly computing the time between operations in transactions, leading to data loss.

5.8.0 Fixed Issues

At times, the Oracle CDC origin can have difficulty processing tables with large data type columns, causing the origin to fail to process operations on these tables. The problematic large data types include Bfile, Blob, Clob, Nclob, Long, Long Raw, and Raw.
The Oracle CDC Client origin allows you to set the maximum connection lifetime to less than 30 seconds even though the minimum supported maximum connection lifetime is 30 seconds. This fix sets the minimum allowed value for maximum connection lifetime to 30 seconds and the default to 30 minutes for new pipelines.
Retries are not triggered for pipelines using the Amazon S3 origin regardless of how retries are configured.
The Oracle CDC origin applies a small number data type when column precision is greater than 18.
Data Collector does not take into account configurations defined in the sdc-java-security.properties file when running on Cloudera.
The Azure Synapse SQL destination uses an Azure Blob Storage endpoint for Azure Blob Storage staging instead of an Azure DEF endpoint.
Sessions time out when using job templates in Start Jobs stages in Control Hub 3.x.
Using map(fields,<path function>) in the Field Mapper processor with the fieldByPreviousPath() or previousPath() functions returns incorrect data if multiple fields have the same value.
Using text data format with large line length values in the following origins can cause a memory leak:
- Azure IoT/Event Hub Consumer
- Directory
- File Tail
- Google Pub/Sub Subscriber
- Hadoop FS
- HTTP Client
- HTTP Server
- MQTT Subscriber
- REST Service
- WebSocket Client
- WebSocket Server

5.8.x Known Issues

When a Snowflake destination is configured to write to a table with a lower-case name, is configured for data drift handling, and has the Upper Case Schema & Field Names property enabled, the destination cannot read the primary key from a primary key header when one is present.
Workaround: Update the table name to a name with only upper-case letters.

5.7.x Release Notes

The Data Collector 5.7.x releases occurred on the following dates:

5.7.2 on November 23, 2023
5.7.1 on October 10, 2023
5.7.0 on October 5, 2023

New Features and Enhancements

Databricks support

You can use all Databricks stages with Databricks Delta Lake Runtime 11.x and 12.x, in addition to 6.x - 8.x.

New stages

MongoDB Atlas Lookup processor - The new processor performs lookups in MongoDB Atlas or MongoDB Enterprise Server and passes all values from the returned document to a new list-map field in the record.

Stage enhancements

Aerospike Client destination - Options for the Expiration Mode and Generation Policy properties have been renamed. The behavior of these options has not changed.

Expiration Mode


Old Option Name	New Option Name
Do not change TTL when record is updated	Do Not Change
Record never expires	Never Expires
Default server TTL configuration	Server Default Value
Expires after TTL	Specified Record TTL

Generation Policy


Old Option Name	New Option Name
Expect Generation Equal	Generation Must Equal
Expect Generation Greater	Generation Must Exceed
Do not use record generation to restrict writes	None

Amazon S3 destination - On the Amazon S3 tab, you can configure the Object Ownership advanced option to determine the ownership of the objects that the destination writes.
By default, objects are written with the bucket owner enforced setting. Access control lists (ACLs) are disabled on the objects, and the bucket owner automatically owns and has full control over the objects.
Amazon S3 origin - On the Amazon S3 tab, you can configure the File Processing Delay property to determine the minimum number of milliseconds that must pass from the time a file is created before it is processed.
Azure Blob Storage and Azure Data Lake Storage Gen2 origins - On the File Configuration tab, you can configure the File Processing Delay property to determine the minimum number of milliseconds that must pass from the time a file is created before it is processed.
Azure Synapse SQL destination - When Merge CDC Data is enabled, you can configure the Primary Key Location to include primary key information in the record header.
Databricks Delta Lake destination:
- On the Databricks Delta Lake tab, you can configure the Directory for Table Location property with an absolute path.
- When Merge CDC Data is enabled, you can configure the Primary Key Location to include primary key information in the record header.
- On the Staging tab, the Azure option for the Staging Location property has been renamed as ADLS Gen2. The functionality of the option has not changed.
Google BigQuery destination - When Merge CDC Data is enabled, you can configure the Primary Key Location to include primary key information in the record header.
HTTP Client processor - Batch Wait Time has been renamed as Max Batch Processing Time.
JDBC Multitable Consumer origin:
- On the JDBC tab, you can configure the Query Timeout advanced option to determine the maximum number of seconds the origin waits for a query to complete.
- On the Tables tab, you can configure the Last Offset property to define a value within the offset column where you want the origin to stop reading. The property is only applicable when performing multithreaded partition processing and when all partition processing requirements are met.
JDBC Tee processor - The Schema Name field supports expressions with time and record functions to specify multiple schema names.
MongoDB Atlas origin and destination - On the Advanced tab, you can configure the maximum number of times that the stage retries a failed connection and the amount of time between each retry.
Oracle CDC origin:
- On the Data tab, you can configure the Unsupported Columns property to determine how the origin handles columns that contain unsupported data types.
- On the Resilience tab, you can configure several new resilience properties:
  - Orphan Operations Event Scan Interval - Milliseconds to wait between scans when the origin detects a transaction with data that is not yet persisted to the redo log.
  - Orphan Operations Event Scan Tries - Maximum number of times to rescan when the origin detects a transaction with data that is not yet persisted to the redo log.
  - Minimum Lag from Database Time - Minimum number of seconds between the time an operation occurs and the database time, before the origin processes the operation.
Oracle CDC Client origin - On the Advanced tab, you can configure the Minimum Lag from Database Time property to determine the minimum number of seconds between the time an operation occurs and the database time, before the origin processes the operation.
SFTP/FTP/FTPS Client destination - On the SFTP/FTP/FTPS tab, you can specify a custom prefix for temporary files.
PostgreSQL Metadata processor - On the JDBC tab, you can configure the processor to use or to ignore primary key and not-null constraints from the original table when creating new tables.
Snowflake destination:
- When the destination uses Snowpipe or the COPY command and is configured to create tables, the destination also uses the primary key information in the jdbc.primaryKeySpecification record header attribute to create primary key columns for the new tables.
- When the destination uses an external stage in Amazon S3, you can configure the destination to add Amazon S3 tags to the staged objects.
- The Snowpipe Private Key PEM property accepts either plain text or a path to the file containing private key content.
- Snowpipe uses the defined Snowflake role instead of the default user role. If no role is specified, the default role is used.
- On the Staging tab, the Azure option for the Staging Location property has been renamed as Azure Blob Storage. The functionality of the option has not changed.
Snowflake Bulk origin:
- Record fields are named after the original column names, as with other origins, instead of numeric placeholders.
- The Connection Pool Size property is now Maximum Connection Threads. The functionality of the property has not changed.
- The Table, Where Clause, and Enable Table Name List properties have been replaced by the following table configuration properties:
  - Table Name Pattern - Name pattern of the tables to read from.
  - Exclusion Pattern - Name pattern of the tables to exclude from being read.
  - Where Clause - Where clause to limit the records to be read.
- The following new properties have been added:
  - Maximum Stage File Reader Threads - Maximum number of threads to use to download remote staging files generated during staging.
  - Maximum Stage File Processing Threads - Maximum number of threads used to create records from local files downloaded from staging.
- On the Staging tab, the Azure option for the Staging Location property has been renamed as Azure Blob Storage. The functionality of the option has not changed.
SQL Server origins - You can configure the SQL Server CDC and SQL Server Change Tracking origins to use active directory authentication.
Syslog destination:
- On the Syslog Connection tab, you can configure the destination to use TCP connections with TLS and mutual TLS authentication.
- The TrueStore Path and TrueStore Password properties are always required when SSL is enabled.

Stage libraries

This release includes the following new stage libraries:


Stage Library	Description
streamsets-datacollector-apache-kafka_3_5-lib	For Kafka version 3.5.x.

Technology Preview functionality

Data Collector includes certain new features with the Technology Preview designation. Technology Preview functionality is available for use in development and testing, but is not meant for use in production.

When Technology Preview functionality becomes approved for use in production, the release notes and documentation reflect the change, and the Technology Preview icon is removed from the UI.

The following Technology Preview features are newly available in this release:

Parquet write support
- The following destinations support Parquet as a data format:
- The following destinations support Parquet as a staging file format:
Schema Generator processor - You can use the processor to generate Parquet schemas.

Upgrade Impact

Removed cluster execution mode

Cluster pipelines have been deprecated for several releases. The Cluster Mesos Streaming execution mode has been removed from Data Collector with this release.

The Cluster Batch and Cluster YARN Streaming execution modes can be used, but are deprecated. StreamSets recommends using StreamSets Transformer instead. For more information, see the Transformer documentation.

Grant users view access for the Oracle CDC origin

Starting with version 5.7.0, the Oracle CDC origin must use a user account with access to the all_tab_cols view.

After you upgrade to version 5.7.0, run the following command in Oracle to grant the user account access to the view:

grant select on all_tab_cols to <user name>;

For CDB databases, run the command from the root container, cdb$root. Then run it again from the pluggable database. For non-CDB databases, run the command from the primary database.

Upgraded pipelines with Amazon S3, Azure Blob Storage, and Azure Data Lake Storage Gen2 origins receive a 10,000 ms processing delay

Starting with version 5.7.0, Amazon S3, Azure Blob Storage, and Azure Data Lake Storage Gen2 origins have a File Processing Delay property that allows you to configure the minimum number of milliseconds that must pass from the time a file is created before it is processed. In upgraded pipelines, these origins receive the default file processing delay of 10,000 milliseconds.

After you upgrade to version 5.7.0, review pipelines that include the Amazon S3, Azure Blob Storage, and Azure Data Lake Storage Gen2 origins. If the 10,000 millisecond delay is not appropriate, update the pipelines accordingly.

Review the batch wait time for ALDS Gen2 (Legacy), Directory, and Hadoop FS Standalone origins

Starting with version 5.7.0, Data Collector treats the batch wait time value as seconds, which can increase the wait time for empty batches in upgraded pipelines. Previous versions of Data Collector incorrectly treat the batch wait time value configured for ALDS Gen1, ALDS Gen2 (Legacy), Directory, and Hadoop FS Standalone origins as milliseconds instead of seconds.

After upgrading to version 5.7.0, review the batch wait time for ALDS Gen2 (Legacy), Directory, and Hadoop FS Standalone origins in upgraded pipelines, and update the value if necessary.

Review the existing target field behavior for Field Renamer processors: Starting with version 5.7.0, the Field Renamer processor processes all fields listed in a set simultaneously and treats any target that existed before beginning to process the set as an existing target field.; In previous versions of Data Collector treats the batch wait time value as seconds, which can increase the wait time for empty batches in upgraded pipelines.
After upgrading to version 5.7.0, review the batch wait time for ALDS Gen1, ALDS Gen2 (Legacy), Directory, and Hadoop FS Standalone origins in upgraded pipelines, and update the value if necessary.

5.7.2 Fixed Issues

When using the Start Jobs origin or processor to start job instances from a job template, the stage always runs jobs in the background, even when the Run in Background property is disabled.
An XSS vulnerability allows including a malicious script in a URL. With this fix, scripts are no longer executed when passed in a URL.

5.7.1 Fixed Issues

Pipelines using Salesforce Bulk stages fail due to an incompatible Jetty version.

5.7.0 Fixed Issues

The Salesforce Bulk API origin fails to process queries with Salesforce relationship names.
Running consecutive DDL operations on the Oracle CDC origin sometimes creates an incorrect internal list of tracked tables, which can provoke incorrect records or pipeline failures.
When Oracle CDC origin receives operations on partitioned tables through LogMiner, the pipeline fails unexpectedly or is unable to generate records for these operations.
The Databricks Delta Lake destination generates errors when Data Collector runs on Java 8 and uses OAuth 2.0 for authentication, if Java Security Manager is enabled.
If no role is assigned to the user, Snowflake destinations using Snowpipe try to use the unspecified role.
Pipeline metrics report only values from the latest execution.
The SQL Server CDC Client origin does not process tables with special characters in their names when the Use Direct Table Query property is enabled.
Excluding tables in a SQL Server CDC Client origin causes a pipeline error.
When starting and stopping pipelines with Azure Blob Storage or Azure Data Lake Gen2 origins, files might be missed or duplicated.
Pipelines with a Directory origin create duplicate data after offset files are deleted.
Orchestration stages give an Unauthorized error message when attempting retries after the session timeout has been reached.
The Oracle CDC origin does not process rollbacks for transactions that have operations on tables with LOB columns.
XLSX files from an Amazon S3 origin do not retain schemas for empty records.
The HTTP Client origin does not fail after reaching the configured maximum number of retries.
The Google Big Query destination gives an Unexpected columnType error for some default column data types.
The ALDS Gen1, ALDS Gen2 (Legacy), Directory, and Hadoop FS Standalone origins incorrectly compare the configured batch wait time value to milliseconds instead of seconds, resulting in a shorter wait time than intended.
This fix has an upgrade impact.

5.7.x Known Issues

At times, the Oracle CDC origin can have difficulty processing tables with large data type columns, causing the origin to fail to process operations on these tables. The problematic large data types include Bfile, Blob, Clob, Nclob, Long, Long Raw, and Raw.
When using Control Hub, Data Collector jobs sometimes unexpectedly fail with the following error: Error while running: java.lang.NullPointerException
When a Snowflake destination is configured to write to a table with a lower-case name, is configured for data drift handling, and has the Upper Case Schema & Field Names property enabled, the destination cannot read the primary key from a primary key header when one is present.
Workaround: Update the table name to a name with only upper-case letters.

5.6.x Release Notes

The Data Collector 5.6.x releases occurred on the following dates:

5.6.3 on October 11, 2023
5.6.2 on September 14, 2023
5.6.1 on August 22, 2023
5.6.0 on June 26, 2023

New Features and Enhancements

New stages

Aerospike Client destination - The new destination writes data to Aerospike.
Kaitai Struct Parser processor - The new processor parses binary data using a Kaitai Struct format description.
Snowflake Bulk origin - The new origin reads the available data from multiple Snowflake tables or views and then stops the pipeline. The origin can use multiple threads to perform parallel processing.

Stage enhancements

Amazon S3, Azure Blob Storage, Directory, and Google Cloud Storage origins - These origins now support the binary data type.
Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2, Hadoop FS, Local FS, and MapR FS destinations - In the Compression Codec property, the Snappy option has been changed to Snappy (Airlift) to indicate that the destination uses the Airlift version of Snappy rather than the standard Snappy version.
Azure Synapse SQL destination - On the Azure Synapse SQL tab, you can define property names and values in Additional Connection Properties to specify standard driver and Hikari properties.
Databricks Delta Lake destination and Databricks Query executor - These stages are no longer part of an enterprise stage library. These stages are now available in the Databricks stage library, streamsets-datacollector-sdc-databricks-lib. The Databricks stage library requires the scheme jdbc:databricks rather than jdbc:spark in the URL or connection string. This change impacts upgrades.
Databricks Delta Lake destination:
- When processing CDC data, the destination can now use the primary key information from record header attributes. In the new Primary Key Location property, you configure where the destination finds the primary key, either in the header attributes or in the stage configuration for each table.
- The Key Columns property has been renamed Table Key Columns. The property is available if you set the Primary Key Location property to Specify for each table. For keys specified in the stage, the Key Columns property has been renamed Table Key Columns.
- With relaxed requirements for table names, the destination supports writing data to Delta Lake tables managed by Unity Catalog.
HTTP Client processor:
- Records that are not processed before the batch wait time expires are sent for error processing rather than discarded. If One Request per Batch is enabled and the records are not processed before the batch wait time expires, all the records in the batch are sent for error processing.
- In the new Compression Library property, you can specify the compression library used to decompress files before reading. By default, the processor independently detects the compression library for each file. For the processor to read files compressed with the Airlift version of Snappy, you must select Snappy (Airlift Snappy) in the Compression Library property. For the processor to read files compressed with the standard version of Snappy, select the default option to detect the compression library automatically.
MongoDB Atlas origin - The Initial Offset property no longer has a default value.
Oracle Bulkload origin - The origin is no longer part of an Enterprise stage library. The origin is included in the JDBC Oracle stage library, streamsets-datacollector-jdbc-oracle-lib. This change impacts upgrades.
Oracle CDC and Oracle CDC Client origins - The origins contain a new property, Fetch Strategy, that sets the method for staging LogMiner results. Staging to a disk-based queue can alleviate memory issues.
Orchestration stages - You can configure the Start Jobs origin, Control Hub API processor, Start Jobs processor, or Wait for Jobs processor to use a Control Hub connection.
Origins with a Compression Format property - For compressed formats, you can specify the compression library that the origin uses to decompress files in the new Compression Library property. By default, the origin independently detects the compression library for each file. For origins to read files compressed with the Airlift version of Snappy, including files from destinations, you must select Snappy (Airlift Snappy) in the Compression Library property. For origins to read files compressed with the standard version of Snappy, select the default option to detect the compression library automatically. This change impacts upgrades.
Snowflake stages - All Snowflake stages, including the Snowflake destination, the Snowflake File Uploader destination, and the Snowflake executor, have improved connection and authentication options:
- To connect to a virtual private Snowflake installation, you have two options. You can configure the stages to compute the virtual private URL automatically from the values in the Account property and either the Snowflake Region or Organization property. Alternatively, you can enter a custom JDBC URL.
- For authentication, the stages support OAuth and key pairs as alternatives to user credentials. To use other authentication methods, you can enter custom connection properties.
Snowflake destination:
- The destination supports arrays. On the Data Advanced tab, you can use the ARRAY Default property to configure the default value for missing or incorrect array values.
- For Amazon S3 external stages, you can configure the S3 Tags property to specify a list of tags to add to created objects.
- On the Snowflake tab, the Error Behavior, Skip File On Error, Max Error Records, and Max Error Record Percentage properties have been moved to the bottom of the tab.
SQL Parser processor - For tables with a primary key, the processor includes the following new record header attributes to track changes in the primary key:
- jdbc.primaryKeySpecification - Includes a JSON-formatted string that lists the columns that form the primary key in the table and the metadata for those columns.
- jdbc.primaryKey.before.<primary key column> - Includes the previous value for the specified primary key column.
- jdbc.primaryKey.after.<primary key column> - Includes the new value for the specified primary key column.
SQL Server CDC Client origin - The Combine Update Records property has been replaced by the Record Format attribute. You can choose between three options to configure how the origin generates records:
- Basic - Generates two records for updates, one with the old data and one with the changed data. This option produces the same result as if Combine Update Records were set to false.
- Basic discarding ‘Before Update’ records - Generates one record for updates, containing the changed data.
- Rich - Generates one record for updates with data written to the Data field, OldData field, or both. This option produces the same result as if Combine Update Records were set to true.
The origin generates a record header property named record_format, which indicates the format of the generated record: 1 indicates basic format, 2 indicates basic discarding “before update” records, and 3 indicates rich.

Stage libraries

This release includes the following new stage libraries:


Stage Library	Description
streamsets-datacollector-aerospike-client-lib	For Aerospike 6.x.
streamsets-datacollector-apache-kafka_3_4-lib	For Kafka version 3.4.x.
streamsets-datacollector-cdp_7_1_8-lib	For Cloudera CDP 7.1.8.
streamsets-datacollector-kaitai-lib	For Kaitai Struct.
streamsets-datacollector-sdc-databricks-lib	For Databricks.

Connections

Aerospike connection - You can use this new Control Hub connection with the Aerospike Client destination.
Databricks Delta Lake connection - The connection requires the Databricks stage library, streamsets-datacollector-sdc-databricks-lib. The Databricks stage library requires the scheme jdbc:databricks rather than jdbc:spark in the URL. This change impacts upgrades.
Orchestrator connection - You can use this new Control Hub connection with the Start Jobs origin, Control Hub API processor, Start Jobs processor, or Wait for Jobs processor.
Snowflake connection:
- You can use the Snowflake connection with the new Snowflake Bulk origin.
- The connection has improved connection and authentication options, described above for the Snowflake stages under “Stage enhancements.”
- Tests of the Snowflake connection properly use every configuration set under Connection Properties.

Additional enhancements

Microsoft Azure Key Vault credential store - Data Collector can use managed identities in addition to client keys to authenticate with Azure Key Vault.
The sdc.properties file contains a new property. When the property is enabled, Data Collector provides an HTTP Strict-Transport-Security (HSTS) response header. To enable the new property, you must configure the https.port property.
Time functions - The StreamSets expression language includes the following new time functions:
- time:nowNumber - Creates a Date object set to the current date and time with milliseconds precision.
- time:nowNanoInstant - Creates a LocalDateTime object set to the current date and time with nanoseconds precision.
- time:nowNanoZonedInstant - Creates a ZonedDateTime object set to the current date and time with nanoseconds precision.
- time:nowNanoTimestampNumber - Creates a Number object set to the current time, specified as epoch time, with nanoseconds precision.
- time:nowNanoTimestampString - Creates a String object set to the current time, specified as epoch time, with nanoseconds precision.
Timestamp functions - The StreamSets expression language introduces a new type of function that offers improved performance over time functions. You can replace a time function with the corresponding timestamp function for better performance. StreamSets recommends using timestamp functions for all new development. The expression language includes the following timestamp functions:
- timestamp:nowDate - Creates a Date object set to the current date and time with millisecond precision. Use an alternative to the time:now function.
- timestamp:nowLocal - Creates a LocalDateTime object set to the current date and time with nanosecond precision. Use as an alternative to the time:now function.
- timestamp:nowMilliseconds - Creates a Long object set to the current date and time with millisecond precision. Use as an alternative to the time:nowNumber function.
- timestamp:nowMillisecondsString - Creates a String object set to the current date and time with millisecond precision. Use as an alternative to the time:nowNanoTimestampString function.
- timestamp:nowNanoseconds - Creates a Double object set to the current date and time with nanosecond precision. Use as an alternative to the time:nowNanoTimestampNumber function.
- timestamp:nowNanosecondsString - Creates a String object set to the current date and time with nanosecond precision. Use as an alternative to the time:nowNanoTimestampString function.
- timestamp:nowZoned - Creates a ZonedDateTime object set to the current date and time with nanosecond precision. Use as an alternative to the time:nowNanoZonedInstant function.
- timestamp:extractStringFromDate - Converts a Date object into a String object, based on a specified date-time format, with nanosecond precision. Use as an alternative to the time:extractStringFromDate function.
- timestamp:extractStringFromDateAndZone - Converts a Date object into a String object, based on a specified date-time format and a time zone, with nanosecond precision. Use as an alternative to the time:extractStringFromDateTZ function.
- timestamp:extractLongFromDate - Converts a Date object into a Long object, based on a specified date-time format, with nanosecond precision. Use as an alternative to the time:extractLongFromDate function.
- timestamp:createDateFromString - Converts a String object into a Date object, based on a specified date-time format, with nanosecond precision. Use as an alternative to the time:extractDateFromString function.
- timestamp:createDateFromStringAndZone - Converts a String object into a Date object based on a specified date-time format and a time zone, with nanosecond precision. Use as an alternative to the time:createDateFromStringTZ function.
Data lineage publication - For each job run, Data Collector can publish data lineage information in a JSON file. You can export and use the file in your data governance solution. To enable data lineage publication, you must add the following lines to the sdc.properties file:
```
lineage.publishers=json
lineage.publisher.json.def=streamsets-datacollector-basic-lib::com_streamsets_pipeline_lib_lineage_JSONLineagePublisher
lineage.publisher.json.config.outputDir=<directory for files>
```
You can optionally include the following line:
```
lineage.publisher.json.config.saveInterval=<time in milliseconds>
```
Data Collector publishes the data lineage information at the specified interval. By default, the interval is set to 60,000 milliseconds. If you specify a negative interval, Data Collector only publishes the information when the pipeline finishes.
Data Collector publishes data lineage information for the following stages:
- Databricks Job Launcher executor
- Dev Data Generator origin
- Google Big Query origin and destination
- Google Cloud Storage origin and destination
- Google Pub Sub Publisher destination
- HBase destination
- HTTP Client origin
- JDBC Multitable Consumer origin
- JDBC Query Consumer origin
- Kinesis Consumer origin
- Kinesis Producer destination
The following stages include field-level data:
- JDBC Multitable Consumer origin
- JDBC Query Consumer origin
- Dev Data Generator origin
- HBase destination

Upgrade Impact

Review bucket properties in Amazon S3 stages

Starting with version 5.6.0, you can no longer include the forward slash (/) in the following properties for Amazon S3 stages due to an Amazon Web Services SDK upgrade:

Bucket property for the Amazon S3 origin
Bucket and path property for the Amazon S3 destination and executor

For more information about this change, see the aws-sdk-java list of Amazon S3 bug fixes.

As a result, you can define only the bucket name in these bucket properties. Use the following properties for each stage to define the path to an object inside the bucket:

Amazon S3 origin - Common Prefix and Prefix Pattern properties
Amazon S3 destination - Common Prefix and Partition Prefix properties
Amazon S3 executor - Object property on the Tasks tab

After you upgrade to version 5.6.0 or later, review the bucket property in Amazon S3 stages to ensure that the property defines the bucket name only. Modify the properties as needed to define only the bucket name in the bucket property and to define the path in the remaining properties.

For example, if an Amazon S3 origin configured in an earlier Data Collector version defines the properties as follows:

Bucket: orders/US/West
Common Prefix:
Prefix Pattern: **/*.log

Update the properties as follows:

Bucket: orders
Common Prefix: US/West/
Prefix Pattern: **/*.log

Install the Databricks stage library to use the Databricks Delta Lake destination, Databricks Query executor, and Databricks Delta Lake connection

Starting with version 5.6.0, the Databricks Delta Lake destination, Databricks Query executor, and Databricks Delta Lake connection require the Databricks stage library. In previous releases, they required the Databricks Enterprise stage library.

After you upgrade to version 5.6.0, install the Databricks stage library, streamsets-datacollector-sdc-databricks-lib, to enable pipelines and jobs that use these Databricks stages or the Databricks connection to run as expected.

Review the JDBC URL in the Databricks Delta Lake destination and the JDBC Connection String in the Databricks Query executor

Starting with version 5.6.0, the scheme of the URL or connection string for the Databricks Delta Lake destination and Databricks Query executor is jdbc:databricks rather than jdbc:spark.

After you upgrade to version 5.6.0, review the JDBC URL property in the Databricks Delta Lake destination and the JDBC Connection String property in the Databricks Query executor to ensure that the scheme resolves to jdbc:databricks.

Note: The upgrade process does not update runtime parameters. You must manually change runtime parameters that define the URL or connection string.

Update the JDBC URL in the Databricks Delta Lake connection

Starting with version 5.6.0, the scheme of the URL is jdbc:databricks rather than jdbc.spark.

After you update a connection to use a version 5.6.0 authoring Data Collector, edit the JDBC URL property to use the

jdbc:databricks

scheme.

Review scripts in Jython stages

Starting with version 5.6.0, Jython stages use Jython 2.7.3 to process data.

After you upgrade to version 5.6.0, review the scripts used in the Jython Scripting origin and the Jython Evaluator processor to ensure that they process data as expected.

Install the Oracle stage library to use the Oracle Bulkload origin

Starting with version 5.6.0, the Oracle Bulkload origin requires the JDBC Oracle stage library. In previous releases, the origin required the Oracle Enterprise stage library.

After you upgrade to version 5.6.0, install the JDBC Oracle stage library, streamsets-datacollector-sdc-databricks-lib, to enable pipelines and jobs that use the Oracle Bulkload origin to run as expected.

Grant users view access for the Oracle CDC origin

Starting with version 5.6.0, the Oracle CDC origin must use a user account with access to the v$containers view.

After you upgrade to version 5.6.0, run the following command in Oracle to grant the user account access to the view:

grant select on v$containers to <user name>;

For CDB databases, run the command from the root container, cdb$root. Then run it again from the pluggable database. For non-CDB databases, run the command from the primary database.

Update origins and processors that read files compressed with the Airlift version of Snappy

Starting with version 5.6.0, origins that read compressed files require you to set the Compression Library property to properly read files compressed with the Airlift version of Snappy. Destinations compress files with the Airlift version of Snappy. This affects the HTTP Client processor and the following origins:

Amazon S3
Azure Blob Storage
Azure Data Lake Storage Gen1
Azure Data Lake Storage Gen2 (Legacy)
Azure IoT/Event Hub Consumer
CoAP Server
Directory
File Tail
Hadoop FS Standalone
Google Cloud Storage
Google Pub/Sub Subscriber
gRPC Client
HTTP Client
HTTP Server
Kafka Multitopic Consumer
MQTT Subscriber
REST Service
SFTP/FTP/FTPS Client
TCP Server
WebSocket Client
WebSocket Server

After you upgrade to version 5.6.0, review your pipelines. In any origins and processors that read files compressed using the Airlift version of Snappy, including files produced by destinations, set the Compression Library property to Snappy (Airlift Snappy).

5.6.3 Fixed Issues

An incompatibility between Java 8 and Eclipse Collection 11 causes pipeline validation errors.

5.6.2 Fixed Issues

The Snowflake and Snowflake File Uploader destinations and the Snowflake executor do not release threads after a pipeline stops.

Orchestration stages do not retry requests after the requests time out.

5.6.1 Fixed Issues

The Azure Synapse SQL, Databricks Delta Lake, Google BigQuery, and Snowflake destinations might generate invalid date, datetime, time, or zoned datetime formats when configured to write data to multiple tables and the Connection Pool Size is set to 0 or to a value greater than 1.

5.6.0 Fixed Issues

The Snowflake destination does not consolidate records properly when using CDC.
The Oracle CDC Client origin does not generate batches if only error records are generated.
To avoid data loss when processing LogMiner records, the Oracle CDC Client origin must store all records from a time window before processing them, which consumes excessive memory in many cases.
The Oracle CDC origin mixes operations of target pluggable databases and unrelated pluggable databases. This fix impacts upgrades.
The time:dateAddition and time:dateDifference functions fail to validate because they expect datetime values in the LocalDateTime date format, which the StreamSets expression language cannot interpret.
In the HTTP Client stages and the Control Hub API processor, setting the Log Level property to a level higher than Info results in no messages written to the log.
The Control Hub API processor does not correctly log request and response data.
Using the JDBC Multitable Consumer origin to read from an Oracle database with multiple threads results in cursor leaks.
Previewing the Azure Data Lake Storage Gen2 origin results in a timeout and no records generated.
The Local FS destination stops processing files upon encountering a file name with a colon.
The SFTP/FTP/FTPS Client destination cannot write files if configured to use the SFTP protocol to connect to a remote server with a storage layer in a Google Cloud Services bucket.
The MySQL BInary Log origin cannot process queries for table or database names with hyphens.
Control Hub randomly provides users and engines with 401 authorization errors.

5.6.x Known Issues

At times, the Oracle CDC origin can have difficulty processing tables with large data type columns, causing the origin to fail to process operations on these tables. The problematic large data types include Bfile, Blob, Clob, Nclob, Long, Long Raw, and Raw.
A Databricks Delta Lake destination staged on ADLS Gen2 and using OAuth 2.0 authentication can fail with a DATA_LOADING_10 error. This can occur because the destination needs to perform the Get Blob Service Properties operation, and the OAuth 2.0 account does not have the appropriate permissions.
Workaround: Grant the OAuth 2.0 account the necessary permissions to perform the Get Blob Service Properties operation. For more information, see the Microsoft documentation.
When using Control Hub, Data Collector jobs sometimes unexpectedly fail with the following error: Error while running: java.lang.NullPointerException
When a Snowflake destination is configured to write to a table with a lower-case name, is configured for data drift handling, and has the Upper Case Schema & Field Names property enabled, the destination cannot read the primary key from a primary key header when one is present.
Workaround: Update the table name to a name with only upper-case letters.

5.5.x Release Notes

The Data Collector 5.5.0 release occurred on April 28, 2023.

New Features and Enhancements

New stages

Azure Data Lake Storage Gen2 origin - The new Azure Data Lake Storage Gen2 origin reads data from Microsoft Azure Data Lake Storage Gen2. The new origin connects to Azure Data Lake Storage Gen2 through the API, which results in improved performance.
The existing Azure Data Lake Storage Gen2 origin has been renamed Azure Data Lake Storage Gen2 (Legacy). Use the new origin for all new development.
Azure Blob Storage origin - The new Azure Blob Storage origin reads data from Microsoft Azure Blob Storage.
SingleStore destination - The new SingleStore destination writes data to a SingleStore database table.

Stage enhancements

Azure Data Lake Storage Gen2 stages - These stages support Azure Managed Identity authentication.
Azure Synapse SQL destination:
- The stage is no longer part of an Enterprise stage library. The stage is part of the Azure stage library, streamsets-datacollector-azure-lib. This change has an upgrade impact.
- With the new Propagate Numeric Precision and Scale property enabled, the destination creates new numeric columns with the precision and scale specified in JDBC record header attributes. With the property disabled or when JDBC header attributes are not available, the stage creates new numeric columns with the default numeric definition.
Google Big Query destination - The destination can stage data in JSON files.
Hive Metadata processor, Hive Metastore destination, and Hive Query executor - In these stages, the Additional Hadoop Configuration property supports credential values.
MongoDB Atlas origin and destination - For pipelines with a MongoDB Atlas origin or destination, validation passes if at least one host specified in the connection string is reachable. Validation fails if all of the hosts specified in the connection string are unreachable.
MongoDB Atlas destination - The destination can update or upsert a map, list, or list-map as a nested or non-nested MongoDB document, depending on the number of unique keys.
OPC UA Client origin:
- The new Max Recursion Depth property defines the maximum depth to browse for recursive processing.
- The Channel Config tab has been renamed Encoding Limits.
- The Max Array Length and Max String Length properties have been removed because they are redundant. The existing Max Message Size property limits the size of the message. This change has an upgrade impact.
PostgreSQL CDC Client and Aurora PostgreSQL CDC Client origins - The origins interpret the values in the Poll Interval property as milliseconds.
Snowflake File Uploader destination - For the file-closure event, the destination writes the file name and path information to two fields: Filename and Filepath.
SQL Server CDC Client origin - The origin generates the primary key record header attributes regardless of whether the Combine Update Records property is enabled.

Connections

Azure Blob Storage - This new Control Hub connection is included for use with the new Azure Blob Storage origin.
Azure Data Lake Storage Gen2:
- You can use the connection with the new Azure Data Lake Storage Gen 2 origin.
- The connection now supports Azure Managed Identity authentication.
Azure Synapse - Because the Azure Synapse SQL destination is no longer part of an Enterprise stage library, you must install the Azure stage library, streamsets-datacollector-azure-lib, to configure or use Azure Synapse connections. This change has an upgrade impact.

New support

Cloudera Manager - To support Data Collector installation with Cloudera Manager, StreamSets now provides a Cloudera parcel for RHEL 8.

Additional enhancements

Runtime properties - The runtime.conf.location property supports both a relative and absolute path. When configuring a separate runtime properties file, specify a relative path for a file inside the Data Collector installation directory or specify an absolute path for a file outside the Data Collector installation directory.
Error information - A new Error Information Level property sets the amount of error information included in email. You configure the property for notifications of changes in pipeline state and for notifications triggered by rules.
Health Inspector - The Health Inspector page includes information about the operating system and version, process uptime, engine time zone, external IP address, and ping and traceroute attempts to Control Hub.
Time functions - The StreamSets expression language includes two new time functions to operate on dates:
- time:dateAddition - Adds an interval to a date.
- time:dateDifference - Determines the interval between two dates.

Upgrade Impact

Install the Azure stage library to use the Azure Synapse SQL destination and connection: Starting with version 5.5.0, the Azure Synapse SQL destination and Azure Synapse connection require the installation of the Azure stage library. In previous releases, the destination and connection required the Azure Synapse Enterprise stage library.; After you upgrade to version 5.5.0, install the Azure stage library, streamsets-datacollector-azure-lib, so that pipelines and jobs that use the Azure Synapse SQL destination or connection run as expected.
Review pipelines with Salesforce stages that import date values: Starting with version 5.5.0, Salesforce stages correctly import date values as dates rather than as strings.; After you upgrade to version 5.5.0, review pipelines with Salesforce stages and ensure that they do not expect dates to be imported as strings.
Review the maximum message size for OPC UA Client pipelines: Starting with version 5.5.0, the OPC UA Client origin no longer includes the Max Array Length or Max String Length properties. These properties were removed because they are redundant. The existing Max Message Size property properly limits the message size regardless of the data type of the message.; After you upgrade to version 5.5.0, review OPC UA Client pipelines to ensure that the configuration for the Max Message Size property is appropriate for the pipeline. The default maximum message size is 2097152.

5.5.0 Fixed Issues

The JDBC Producer destination does not update the content of a column that was added by the PostgreSQL Metadata processor.
The MongoDB Atlas destination does not write StreamSets Map and List fields as expected for update and upsert operations.
The Oracle CDC origin does not properly handle null and empty values when converting from hex to target data types.
The Oracle CDC origin provides negative values for some summary counters.
The Oracle CDC origin does not correctly process time zones expressed as UTC Offset Standard Time, such as +05:00 or -07:00.
The Oracle CDC origin considers a column to be a pseudocolumn if its name matches a documented pseudocolumn, even if the table definition includes the column.
The Oracle CDC origin converts Oracle Date columns to the Data Collector Date data type instead of Datetime.
The Oracle CDC Client origin reads data slower when using continuous mining due to changes in the caching algorithm.
The Oracle CDC Client origin fails to parse LOB_WRITE, LOB_TRIM, and LOB_ERASE records that contain Blob or Clob fields when the Use PEG Parser property is enabled.
The Salesforce stages import date values as strings. This fix has an upgrade impact.

5.5.x Known Issues

At times, the Oracle CDC origin can have difficulty processing tables with large data type columns, causing the origin to fail to process operations on these tables. The problematic large data types include Bfile, Blob, Clob, Nclob, Long, Long Raw, and Raw.
The time:dateAddition and time:dateDifference functions fail to validate because they expect datetime values in the LocalDateTime date format, which the StreamSets expression language cannot interpret.
When using Control Hub, Data Collector jobs sometimes unexpectedly fail with the following error: Error while running: java.lang.NullPointerException
When a Snowflake destination is configured to write to a table with a lower-case name, is configured for data drift handling, and has the Upper Case Schema & Field Names property enabled, the destination cannot read the primary key from a primary key header when one is present.
Workaround: Update the table name to a name with only upper-case letters.

5.4.x Release Notes

The Data Collector 5.4.0 release occurred on February 28, 2023.

New Features and Enhancements

Oracle CDC support

You can use a new Oracle CDC origin to process change data from Oracle redo logs. Like the original Oracle CDC Client origin, the new Oracle CDC origin uses LogMiner to access online or archived redo logs.

This origin simplifies some aspects of available properties while expanding your ability to customize generated records. The default record is just like the one generated by the old origin to enable replacing the old origin with the new origin in existing pipelines. For more information about the similarities and differences between the two origins, see Comparing Oracle CDC Origins

StreamSets recommends using the new Oracle CDC origin for all new development.

Stage enhancements

Amazon stages - Amazon stages include updated regions in the AWS Region property.
Azure Data Lake Storage Gen2 stages - You can configure the new Endpoint URL property when using OAuth with Service Principal authentication with the Azure Data Lake Storage Gen2 origin and destination, and the ADLS Gen2 File Metadata executor.
JDBC Multitable Consumer origin - You can configure the new Maximum Number of Tables property to limit the number of tables to prefetch.
Google BigQuery destination - The destination allows enabling the Create Table property only when it is configured to handle data drift.
This is a reversion of a change in 5.3.0, which allows creating tables when the destination is not configured to handle data drift.
JDBC Tee processor and JDBC Producer - You can use a new useLegacyZonedDatetime JDBC configuration property to help with MySQL driver upgrade issues.
MySQL driver versions 8.0.23 and later return zoned datetimes in a different format. If you upgrade from an older MySQL driver to 8.0.23 and later, you can add useLegacyZonedDatetime as an Additional JDBC Configuration property and set it to true to have the stages provide zoned datetimes in the previous format.
MongoDB Atlas stages - You can configure the UUID Interpretation Mode advanced property to specify how a MongoDB Atlas stage handles UUID fields.
Orchestration stages - Orchestration stages that connect to Control Hub include new Max Number Of Tries and Retry Interval properties that determine how the stages try to connect to Control Hub after encountering communication errors.
Salesforce stages - All Salesforce stages now use version 57.0.0 of the Salesforce API by default.
Snowflake stages: The following enhancements apply to the Snowflake and Snowflake File Uploader destinations, and the Snowflake executor:
- Snowflake stages are no longer part of an Enterprise stage library. The stages are now available in the Snowflake stage library, streamsets-datacollector-sdc-snowflake-lib. This change has upgrade impact.
- When specifying an organization on the Snowflake Connection Info tab, you no longer need to specify a Snowflake region.
- Snowflake stages can access all of the latest regions for AWS, GCP, and Azure.
Snowflake destination:
- The destination can automatically create tables when configured to handle data drift or use Snowpipe to load data.
  Previously, it did not create tables when using Snowpipe to load data.
- When you configure the Snowflake destination to use an Amazon S3 staging location, you no longer specify the S3 region. Data Collector now queries Snowflake for that information.
Snowflake executor - You can use the new Warehouse property to define the warehouse to connect to.
SQL Parser processor - You can configure the Parsing Thread Pool Size property to enable the processor to use multiple threads when processing data.

Support

PostgreSQL 15.x support - You can use the PostgreSQL CDC Client origin and JDBC stages with PostgreSQL 15.x.
Red Hat Enterprise Linux support - You can install Data Collector on Red Hat Enterprise Linux version 9.x, in addition to 6.x - 8.x.

New stage library:


Stage Library	Description
streamsets-datacollector-apache-kafka_3_3-lib	For Kafka version 3.3.x.
streamsets-datacollector-sdc-snowflake-lib	For Snowflake.

Additional enhancements

New function - A new pipeline:email() function returns the email address of the user who started the pipeline.
Data governance tools - Data governance tools can now publish metadata about the Kinesis Consumer origin.
Java Security Manager - Data Collector can use the Java Security Manager only when using Java 8. Oracle has deprecated and marked Java Security Manager for removal. As a result, when using Java 9 or later, Data Collector cannot use the security manager.
Previous releases enabled the Java Security Manager by default for all Java versions, which in some cases caused known issues when using Java 9 or later. To avoid those issues, you had to disable the security manager by setting the SDC_SECURITY_MANAGER_ENABLED environment variable to false.

Upgrade Impact

Install the Snowflake stage library to use Snowflake stages and connections: Starting with version 5.4.0, using Snowflake stages and Snowflake connections requires installing the Snowflake stage library. In previous releases, Snowflake stages and connections were available with the Snowflake Enterprise stage library.; After you upgrade to version 5.4.0, install the Snowflake stage library, streamsets-datacollector-sdc-snowflake-lib, to enable pipelines and jobs that use Snowflake stages or connections to run as expected.

5.4.0 Fixed Issues

When using JDBC stages with MySQL driver versions 8.0.16 and later, the stages can encounter data type conversion errors that cause an entire batch to be treated as error records instead of the individual records with the problem.
The Oracle CDC Client origin can generate a null pointer exception when processing user-defined columns whose data type is ROWID. Oracle pseudocolumns are not affected by this issue.
The Oracle CDC Client origin can generate a null pointer exception when the pipeline stops when the origin is processing error records.
When the Oracle CDC Client origin scans redo logs to check for session integrity issues within a LogMiner window, current offset handling can cause it to lose data if missing changes appear in an unexpected order.
The Oracle CDC Client origin generates Update records for Inserts that include Blob or Clob fields.
Data Collector fails to start when the Java installation package version includes only the major version, without a specified minor version. For example, a version packaged as Java 11 can cause Data Collector to fail to start, but Data Collector starts as expected with a Java 11.5 package.
The PostgreSQL CDC origin generates a null pointer exception when processing null values in numeric fields.
The Snowflake destination does not perform case-sensitive evaluation of primary keys or properly honor the Upper Case Schema and Field Names property.
The JDBC Multitable Consumer origin can generate a null pointer exception when using multiple threads.
When upgrading from Data Collector 5.2.0 or earlier to 5.3.0, JDBC Multitable Consumer pipelines do not upgrade properly when the Number of Threads property is defined using an expression.
The Cassandra destination includes two Write Timeout properties, instead of a Write Timeout property and a Socket Read Timeout property.
Kafka messages headers are not available when using Kafka Java client version 0.11 even though the library supports them.

5.4.x Known Issues

At times, the Oracle CDC origin can have difficulty processing tables with large data type columns, causing the origin to fail to process operations on these tables. The problematic large data types include BLOB, CLOB, NCLOB, RAW, LONG, LONG RAW, and BFILE.
The Oracle CDC origin converts Oracle Date columns to the Data Collector Date data type instead of Datetime.
Workaround: Use a Field Type Converter processor to convert the Date field to Datetime.
When using Control Hub, Data Collector jobs sometimes unexpectedly fail with the following error: Error while running: java.lang.NullPointerException
When a Snowflake destination is configured to write to a table with a lower-case name, is configured for data drift handling, and has the Upper Case Schema & Field Names property enabled, the destination cannot read the primary key from a primary key header when one is present.
Workaround: Update the table name to a name with only upper-case letters.

5.3.x Release Notes

The Data Collector 5.3.0 release occurred on December 2, 2022.

New Features and Enhancements

Java 11 and 17 support

With this release, Data Collector supports Java 11 and 17 in addition to Java 8. Due to third-party requirements, some Data Collector features require a particular Java version. For more information, see Java Versions and Available Features.

Stage enhancements

Amazon S3 executor - The executor supports server-side encryption.
Dev Data Generator origin - This development origin can generate field attributes to enable easily testing field attribute functionality in pipelines.
Directory and HDFS Standalone origins - The Directory and HDFS Standalone origins include a new Ignore Temporary Files property that enables the origins to skip processing Data Collector temporary files with a _tmp_ prefix.
Field Flattener processor - The Output Type property allows you to choose the root field type for flattened records: Map or List-Map.
Field Mapper processor - The Create New Paths property allows the processor to create new paths when changing record structures.
Field Replacer processor - The Field Does Not Exist property includes the following new options:
- Add New Field - Adds the fields defined on the Replace tab to records if they do not exist.
- Ignore New Field - Ignores any fields defined on the Replace tab if they do not exist.
These new options replace the Include without Processing option. Upgraded pipelines are set to Add New Field. This can have upgrade impact.
Google BigQuery destination and executor - These stages are no longer part of an Enterprise stage library. This includes the following updates:
- The stages, previously known as the Google BigQuery (Enterprise) destination and the Google BigQuery (Enterprise) executor, are renamed to the Google BigQuery destination and Google BigQuery executor.
- The stages are now available in the Google Cloud stage library, streamsets-datacollector-google-cloud-lib, and are available to install like any other Data Collector stage. This change has upgrade impact.
Google BigQuery destination:
- The destination can write nested Avro data.
- The destination can use BigQuery to generate schemas.
- The destination now allows enabling the Create Table property when the destination is not configured to handle data drift.
JDBC Multitable Consumer origin - You can no longer set the Minimum Idle Connections property higher than the Number of Threads property.
Upgraded pipelines have the Minimum Idle Connections property set to the same value as the Number of Threads property. This can have upgrade impact.
Kafka message header support - The following functionality is available when using a Kafka Java client version 0.11 or later:
- The Kafka Multitopic Consumer origin includes Kafka message headers as record header attributes.
- The Kafka Producer destination includes all user-defined record header attributes as Kafka message headers when writing to Kafka.
MongoDB Atlas destination - The destination can update nested documents.
OPC UA Client origin - The origin includes a new Override Host property which overrides the host name returned from the OPC UA server with the host name configured in the resource URL.
Oracle CDC Client origin - The minimum value for the following advanced properties have changed from 1 millisecond to 0 milliseconds:
- Time between Session Windows
- Time after Session Window Start
Start Job origin and processor - The Search Mode property enables the origin and processor to search for the Control Hub job to start.

Connections

Google BigQuery update - Due to the BigQuery change from enterprise, you must install the Google Cloud stage library, streamsets-datacollector-google-cloud-lib, to configure or use Google BigQuery connections. This change has upgrade impact.

Stage libraries

This release includes the following new stage libraries:


Stage Library	Description
streamsets-datacollector-mapr_7_0-lib	For MapR 7.0.x.
streamsets-datacollector-mapr_7_0-mep8-lib	For MapR 7.0.x with MEP 8.x.

Additional functionality

Data governance tools - Data governance tools support publishing metadata about the following additional stages:
- HTTP Client origin
- JDBC Multitable Consumer origin
Advanced data format properties - Optional properties for data formats have become advanced options. You must now view advanced options to configure them.
Delimited data format enhancements:
- When writing delimited data using a destination or the HTTP Client processor, you can define properties to enable writing multicharacter delimited data.
- When writing delimited data using a custom delimiter format, you can configure the Record Separator String property to define a custom record separator.
Pipeline start events - Pipeline start event records include a new email field that contains the email address of the user who started the pipeline.

Upgrade Impact

Install the Google Cloud stage library to use BigQuery stages and connections

Starting with version 5.3.0, using Google BigQuery stages and Google BigQuery connections require installing the Google Cloud stage library. In previous releases, BigQuery stages and connections were available with the Google BigQuery Enterprise stage library.

After you upgrade to version 5.3.0, install the Google Cloud stage library, streamsets-datacollector-google-cloud-lib, to enable pipelines and jobs using BigQuery stages or connections to run as expected.

Review minimum idle connections for JDBC Multitable Consumer origins

Starting with version 5.3.0, the Minimum Idle Connections property in the JDBC Multitable Consumer origin cannot be set higher than the Number of Threads property. In previous releases, there was no limit to the number of minimum idle connections that you could configure.

Upgraded pipelines have the Minimum Idle Connections property set to the same value as the Number of Threads property.

After you upgrade to version 5.3.0, review JDBC Multitable Consumer origin pipelines to ensure that the new value for the Minimum Idle Connections property is appropriate for each pipeline.

Review missing field behavior for Field Replacer processors

Starting with version 5.3.0, the advanced Field Does Not Exist property in the Field Replacer processor has the following two new options that replace the Include without Processing option:

Add New Field - Adds the fields defined on the Replace tab to records if they do not exist.
Ignore New Field - Ignores any fields defined on the Replace tab if they do not exist.

After you upgrade to version 5.3.0, the Field Does Not Exist property is set to Add New Field. Review Field Replacer pipelines to ensure that this behavior is appropriate.

Review runtime:loadResource pipelines

Starting with version 5.3.0, pipelines that include the runtime:loadResource function fail with errors when the function calls a missing or empty resource file. In previous releases, those pipelines sometimes continued to run without errors.

After you upgrade to version 5.3.0, review pipelines that use the runtime:loadResource function and ensure that the function calls resource files that include the required information.

Manage underscores in Snowflake connection information

Starting with the Snowflake JDBC driver 3.13.25 release in November 2022, the Snowflake JDBC driver converts underscores to hyphens, by default.

This can adversely affect communicating with Snowflake when Snowflake connection information specified in a Snowflake stage or connection, such as a URL, includes underscores. When needed, you can bypass this behavior by setting the allowUnderscoresInHost driver property to true. For more information and alternate solutions, see this Snowflake community article.

5.3.0 Fixed Issues

To avoid the Text4Shell vulnerability, Data Collector 5.3.0 is packaged with version 1.10.0 of the Apache Commons Text library.
When used in some locations, the runtime:loadResource function can silently fail and stop the pipeline when trying to load an empty or missing resource file, giving no indication of the problem.

With this fix, when failing to load an empty or missing resource file, the runtime:loadResource function generates an error that stops the pipeline. This fix has upgrade impact.
Pipelines cannot be deleted when Data Collector uses Network File System (NFS).
When a query fails to produce results, the Elasticsearch origin stops when the network socket times out. With this fix, the origin continues retrying the query until the cursor expires.
The Email executor does not work when using certain providers, such as SMPT, due to a change in the version of a file.
The SQL Server CDC Client and SQL Server Change Tracking origins fail to function properly in a Control Hub fragment.
When the Oracle CDC Client origin is not configured to process Blob or Clob columns, the origin includes Blob and Clob field names in the record with either null values or raw string values depending on whether the Unsupported Fields to Records property is enabled.
The java.security.networkaddress.cache.ttl Data Collector configuration property does not cache Domain Name Service (DNS) lookups as expected.

5.3.x Known Issues

Data Collector fails to start when the Java installation package version includes only the major version, without a specified minor version. For example, a version packaged as Java 11 can cause Data Collector to fail to start, but Data Collector starts as expected with a Java 11.5 package.
Workaround: Upgrade to Data Collector 5.4.0 or later, where this issue is fixed. Or, use a Java installation package with a minor version.
You can check the Java installation package version installed on a machine by running the following command: java -- version
When upgrading from Data Collector 5.2.0 or earlier to 5.3.0, JDBC Multitable Consumer pipelines do not upgrade properly when the Number of Threads property is defined using an expression.
Workaround: Upgrade to Data Collector 5.4.0 or later, where this issue is fixed. Or, replace the expression in the Number of Threads property with a static value.
Kafka messages headers are not available when using Kafka Java client version 0.11 even though the library supports them.
When using Control Hub, Data Collector jobs sometimes unexpectedly fail with the following error: Error while running: java.lang.NullPointerException
When a Snowflake destination is configured to write to a table with a lower-case name, is configured for data drift handling, and has the Upper Case Schema & Field Names property enabled, the destination cannot read the primary key from a primary key header when one is present.
Workaround: Update the table name to a name with only upper-case letters.

5.2.x Release Notes

The Data Collector 5.2.0 release occurred on September 29, 2022.

New Features and Enhancements

New stages

MongoDB Atlas origin and destination - You can use the new MongoDB Atlas origin and destination to read from and write to MongoDB Atlas and MongoDB Enterprise Server.

Stage enhancements

Groovy stages - The Groovy Scripting origin and the Groovy Evaluator processor now support Groovy 4.0.
JDBC Multitable Consumer origin - The origin now provides the jdbc.primaryKeySpecification record header attribute for records from tables with a primary key and the jdbc.vendor record header attribute for all records.
JDBC Tee processor and JDBC Producer destination - These stages can manage primary key values updates using the jdbc.primaryKey.before.columnName record header attribute for the old value and the jdbc.primaryKey.before.columnName record header attribute for the new value.
MQTT stages - The MQTT Subscriber origin and the MQTT Publisher destination now support entering a list of brokers from high availability MQTT clusters without a load balancer.
Oracle CDC Client origin:
- Conditional Blob and Clob support - When the origin buffers changes locally, you can configure the origin to process Blob and Clob data using the following advanced properties:
  - Enable Blob and Clob Columns Processing property - Enable this property to process Blob and Clob columns.
  - Maximum LOB Size property - Optional property to define the maximum LOB size. When specified, overflow data is discarded.
- LogMiner Query Timeout property - This property defines how long the origin waits for a LogMiner query to complete.
- Time between Session Windows property - This advanced property sets the time to wait after a LogMiner session has been completely ingested. This ensures a minimum LogMiner window size.
- Time after Session Window Start property - This advanced property sets the time to wait after a LogMiner session starts. This allows Oracle to finish setting up before processing begins.
Pipeline Finisher executor - The executor includes new React to Events and Event Type properties that enables the executor to stop a pipeline only upon receiving the specified event record type.
For example, you can now configure the executor to stop the pipeline only after receiving a no-more-data event record, and to ignore all other records that it might receive. Previously, you might have used a precondition or a Filter processor to ensure that the executor received only no-more-data events.
Snowflake stages - Snowflake stages have been updated to support all Snowflake regions.
SQL Server CDC Client origin - The origin can be configured to combine the two update records that SQL Server creates into a single record and generate the records differently.
With this property enabled, the origin generates record header attributes about the primary key.
SQL Server Change Tracking origin - The origin generates record header attributes about the primary key.

Connections when registered with Control Hub

When Data Collector version 6.1.0 is registered with Control Hub cloud or with Control Hub on-premises version 3.19.x or later, the following stages support using Control Hub connections:

Stage libraries

This release includes the following new stage libraries:

streamsets-datacollector-groovy_4_0-lib
streamsets-datacollector-mongodb_atlas-lib

Additional functionality

New Data Collector configuration property - To cache Domain Name Service (DNS) lookups, you can use the new networkaddress.cache.ttl property in the $SDC_DIST/etc/sdc-java-security.propertiesfile.
With this change, the java.security.networkaddress.cache.ttl Data Collector property has been deprecated.
Help options - The Local Help option has been removed from the Help configuration option in Help > Settings. When you view Data Collector help, you will always view it on the StreamSets documentation website: https://docs.streamsets.com.

Upgrade Impact

Review MySQL Binary Log pipelines

With 5.2.0, the MySQL Binary Log origin converts MySQL Enum and Set fields to String fields.

In previous releases, when reading from a database where the binlog_row_metadata MySQL database property is set to MINIMAL, Enum fields are converted to Long, and Set fields are converted to Integer.

In 5.2.0 as well as previous releases, when the binlog_row_metadata MySQL database property is set to FULL, Enum and Set fields are converted to String.

After you upgrade to 5.2.0, review MySQL Binary Log pipelines that process Enum and Set data from a database with binlog_row_metadata set to MINIMAL. Update the pipeline as needed to ensure that Enum and Set data is processed as expected.

Review Oracle CDC Client pipelines

With 5.2.0, the Oracle CDC Client origin has new advanced properties that enable processing Blob and Clob columns. You can use these properties when the origin buffers changes locally. They are disabled by default.

In previous releases, the origin does not process Blob or Clob columns. However, when the Unsupported Fields to Records property is enabled, the origin includes Blob and Clob field names and raw string values in records.

Due to a known issue with this release, when the origin is not configured to process Blob and Clob columns and when the Unsupported Fields to Records property is enabled, the origin continues to include Blob and Clob field names and raw string values in records. When the property is disabled, the origin includes Blob and Clob field names with null values. The expected behavior is to always include field names with null values unless the origin is configured to process Blob and Clob columns.

Review Oracle CDC Client pipelines to assess how they should handle Blob and Clob fields:

To process Blob and Clob columns, enable Blob and Clob processing on the Advanced tab. You can optionally define a maximum LOB size.
Verify that sufficient memory is available to Data Collector before enabling Blob and Clob processing.
If the origin has the Unsupported Fields to Records property enabled, the origin continues to include Blob and Clob field names and raw string values, as in previous releases.
If the origin has the Unsupported Fields to Records property disabled, and if null values are acceptable for Blob and Clob fields, then no action is required at this time.
In a future release, this behavior will change so the Unsupported Fields to Records property has no effect on how Blob and Clob columns are processed.

5.2.0 Fixed Issues

The MySQL Binary Log origin converts Enum and Set fields to different field types based on how the binlog_row_metadata database property is set. This fix has upgrade impact.
The Include Deleted Records property in the Salesforce Lookup processor does not display.
The Salesforce Bulk API destination can encounter problems when generating error records.
Pipeline parameters do not work properly with required list properties.

5.2.x Known Issues

The java.security.networkaddress.cache.ttl Data Collector configuration property does not cache Domain Name Service (DNS) lookups as expected.
Workaround: Use the new networkaddress.cache.ttl property in the $SDC_DIST/etc/sdc-java-security.properties file.
When the Oracle CDC Client origin is not configured to process Blob or Clob columns, the origin includes Blob and Clob field names in the record with either null values or raw string values depending on whether the Unsupported Fields to Records property is enabled. This issue has upgrade impact.

5.1.x Release Notes

The Data Collector 5.1.0 release occurred on July 28, 2022.

New Features and Enhancements

New stage

Pulsar Consumer origin - The new Pulsar Consumer origin can use multiple threads to read from Pulsar. The origin supports schema validation and Pulsar namespaces configured to enforce schema validation. You can specify the schema used to determine compatibility between the origin and a Pulsar topic. You can also use JWT authentication with the new origin.
With this new origin, the existing Pulsar Consumer has been renamed Pulsar Consumer (Legacy). Use this new Pulsar origin for all new development.

Stage enhancements

Aurora PostgreSQL CDC Client and PostgreSQL CDC Client origins - Both origins can now generate a record for each individual operation.
Previously, the origins could only generate a record for each transaction.
Aurora PostgreSQL CDC Client, PostgreSQL CDC Client, and MySQL Binary Log origins - These origins include the following new record header attributes when a table includes a primary key:
- jdbc.primaryKeySpecification - Includes a JSON-formatted string that lists the columns that form the primary key in the table and the metadata for those columns.
- jdbc.primaryKey.before.<primary key column> - Includes the previous value for the specified primary key column.
- jdbc.primaryKey.after.<primary key column> - Includes the new value for the specified primary key column.

Kafka Multitopic Consumer origin and Kafka Producer destination - The stages include a new Custom Authentication security option that enables specifying custom properties that contain the information required by a security protocol, rather than using predefined properties associated with other security options.
OPC UA Server origin - The origin now supports using a user name and password to authenticate with the OPC UA server, in addition to an anonymous log in.
Oracle CDC Client origin:
- The origin now supports reading from Oracle 21c databases.
- The field order of generated records now matches the column order in database tables. Previously, the field order was not guaranteed.
- When you configure the origin to use local buffers and write to disk, you can specify an existing directory to use.
- A new Data Collector configuration property affects the origin and can have upgrade impact. For details, see Upgrade Impact.
Pulsar Consumer (Legacy) origin:
- The origin, formerly named Pulsar Consumer, has been renamed with this release.
  This change has no upgrade impact. However, we recommend using the new Pulsar Consumer origin, which supports multithreaded processing, to read from Pulsar.
- You can specify the schema used to determine compatibility between the origin and a Pulsar topic.
- You can also use JWT authentication with the origin.
Salesforce Bulk API 2.0 stages:
- All Salesforce Bulk API 2.0 stages include a new Salesforce Query Timeout property which defines the number of seconds that the stage waits for a response to a query.
- The Salesforce Bulk API 2.0 origin and Salesforce Bulk API 2.0 Lookup processor both include a new Maximum Query Columns property that limits the number of columns that can be returned by a query.
Scripting stages - The Groovy, JavaScript, and Jython Evaluator origins and processors now generate metrics for script execution and locking details that you can view when monitoring the pipeline.
Field Remover - The processor now includes the On Record Error property on the General tab.
Field Type Converter processor - When converting a Date, Datetime, or Time field, the Date Format property now offers explicit options to specify that the field contains a Unix timestamp in milliseconds or seconds.
If the field contains a Unix timestamp and you select an alternate date format, then the behavior is unchanged: the processor assumes the timestamps are in milliseconds.
SQL Parser processor:
- The processor adds the fields from the SQL statement in the same order as the corresponding columns in the database tables.
- The processor now includes field attributes for columns converted to the Decimal or Datetime data types in Data Collector. The attributes provide additional information for each field.
Pulsar Producer destination - You can now use JWT authentication with the destination.

Connection enhancements

Kafka connection - The connection includes a new Custom Authentication security option that enables specifying custom properties that contain the information required by a security protocol, rather than using predefined properties associated with other security options.
OPC UA connection - The connection now supports using a user name and password to authenticate with the OPC UA server in addition to an anonymous log in to the server.
Pulsar connection - The connection can now use JWT authentication to connect to Pulsar.
Snowflake connection - The new Connection Properties property enables you to specify additional connection properties for Snowflake connections.

Enterprise stage libraries

In August 2022, StreamSets released the following Enterprise stage libraries:

Azure Synapse 1.3.0
Databricks 1.7.0
Snowflake 1.12.0

For more information about these releases, see the Enterprise Libraries release notes.

Additional enhancements

Data Collector Docker image - The Docker image for Data Collector 5.1.0, streamsets/datacollector:5.1.0, uses Ubuntu 20.04 LTS (Focal Fossa) as a parent image. This change can have upgrade impact.
Microsoft JDBC Driver for SQL Server - Data Collector uses version 10.2.1 of the driver to connect to Microsoft SQL Server. Due to changes in the driver, this can have upgrade impact.
Data Collector configuration properties - You can define the following new configuration properties:
- stage.conf_com.streamsets.pipeline.lib.jdbc.disableSSL - When this configuration property is set to true, Data Collector attempts to disable SSL for all JDBC connections. This property is commented out by default.
- stage.conf_com.streamsets.pipeline.stage.origin.jdbc.cdc.oracle.monitorbuffersize - When this configuration property is set to true, Data Collector reports memory consumption when the Oracle CDC Client origin uses local buffers.
  This property is set to false by default. In previous releases, the origin reported this information by default, so this enhancement has upgrade impact.

Stage libraries

This release includes the following new stage libraries:

streamsets-datacollector-apache-kafka_3_0-lib - For Apache Kafka 3.0.
streamsets-datacollector-apache-kafka_3_1-lib - For Apache Kafka 3.1.
streamsets-datacollector-apache-kafka_3_2-lib - For Apache Kafka 3.2.

Upgrade Impact

Review SQL Server pipelines without SSL/TLS encrypted connections

With 5.1.0, Data Collector uses Microsoft JDBC Driver for SQL Server version 10.2.1 to connect to Microsoft SQL Server. According to Microsoft, this version has introduced a breaking backward-incompatible change.

As a result, after you upgrade to Data Collector 5.1.0, upgraded pipelines that connect to Microsoft SQL Server without SSL/TLS encryption will likely fail with a message such as the following:

The driver could not establish a secure connection to SQL Server by using Secure Sockets Layer (SSL) encryption.

This issue can be resolved by configuring SSL/TLS encryption between Microsoft SQL Server and Data Collector. For details on configuring clients for SSL/TLS encryption, see the Microsoft SQL Server documentation.

Otherwise, you can address this issue at a pipeline level by adding encrypt=false to the connection string, or by adding encrypt as an additional JDBC property and setting it to false.

To avoid having to update all affected pipelines immediately, you can configure Data Collector to attempt to disable SSL/TLS for all pipelines that use a JDBC driver. To do so, set the stage.conf_com.streamsets.pipeline.lib.jdbc.disableSSL Data Collector configuration property to true. Note that this property affects all JDBC drivers, and should typically be used only as a stopgap measure. For more information about the configuration property, see Configuring Data Collector.

Review reporting requirements for Oracle CDC Client pipelines

With 5.1.0, pipelines that include the Oracle CDC Client origin no longer report memory consumption data when the origin uses local buffers. In previous releases, this reporting occurred by default, which slowed pipeline performance.

After you upgrade to 5.1.0, memory consumption reporting for Oracle CDC Client local buffer usage is no longer performed by default. If you require this information, you can enable it by setting the stage.conf_com.streamsets.pipeline.stage.origin.jdbc.cdc.oracle.monitorbuffersize Data Collector configuration property to true.

This property enables memory consumption data reporting for all Oracle CDC Client pipelines that use local buffering. Because it slows pipeline performance, as a best practice, enable the property only for short term troubleshooting.

Review Dockerfiles for custom Docker images

In previous releases, the Data Collector Docker image used Alpine Linux as a parent image. Due to limitations in Alpine Linux, with this release the Data Collector Docker image uses Ubuntu 20.04 LTS (Focal Fossa) as a parent image.

If you build custom Data Collector images using earlier releases of streamsets/datacollector as the parent image, review your Dockerfiles and make all required updates to become compatible with Ubuntu Focal Fossa before you build a custom image based on streamsets/datacollector:5.1.0.

5.1.0 Fixed Issues

When the Oracle CDC Client origin is configured to use the PEG Parser for processing, the jdbc.primaryKey.before.<primary key column> and the jdbc.primaryKey.after.<primary key column> record header attributes are not set correctly.
If an Oracle RAC node is force stopped, the Oracle CDC client origin can stop producing records, although it is actually mining through a LogMiner session. A new LogMiner session needs to be recreated instead of retrying until system stabilizes. This issue is related to the internal tasks that Oracle runs after restarting a crashed node.
Data loss can occur when the Oracle CDC Client origin does not use local buffering and the pipeline is stopped while a transaction that contains multiple operations and spans several seconds is being processed.
Data Collector can fail to load a PostgreSQL driver correctly when you have pipelines that use different stages to access PostgreSQL.
The Maximum Parallel Requests property in the HTTP Client processor and destination does not work as expected.
This property was removed from the processor and destination because these stages do not support parallel requests.
The MySQL Binary Log origin can fail when the order of columns in a source table changes.
This issue is fixed when using the origin from Data Collector version 5.1.0 or later to read from MySQL 8.0 or later. However, you must set the binlog_row_metadata MySQL configuration property to FULL.
The MySQL Binary Log origin can stall and stop processing due to a problem with an internal queue. If you attempt to stop the pipeline at that time, the pipeline can become non-responsive.
When registered with Control Hub, the SFTP/FTP/FTPS connection does not include private key properties that enable configuring the connection to use private key authentication.

5.1.x Known Issues

There are no important known issues at this time.

5.0.x Release Notes

The Data Collector 5.0.0 release occurred on April 29, 2022.

New Features and Enhancements

New stages

Aurora PostgreSQL CDC Client origin - Use the origin to process Write-Ahead Logging (WAL) data to generate change data capture records for an Amazon Aurora PostgreSQL database.
Salesforce Bulk API 2.0 origin - Reads from Salesforce using Salesforce Bulk API 2.0.
Salesforce Bulk API 2.0 Lookup processor - Performs lookups on Salesforce data using Salesforce Bulk API 2.0.
Salesforce Bulk API 2.0 destination - Writes to Salesforce using Salesforce Bulk API 2.0.

Updated stages

HTTP Client enhancements - When using OAuth2 authentication with HTTP Client stages, you can configure the following new properties:
- Use Custom Assertion and Assertion Key Type - Use these properties to specify a custom parameter for passing the JSON Web Token (JWT).
- JWT Headers - Use to specify headers to include in the JWT.
JMS Producer destination - You can configure the destination to remove the jms.header prefix from record header attribute names before including the information as headers in the JMS messages.
Kafka Multitopic Consumer origin - The origin includes the following new properties:
- Topic Subscription Type and Topic Pattern - Use these two properties to specify a regular expression that defines the topic names to read from, instead of simply listing the topic names.
- Metadata Refresh Time - Specify the milliseconds to wait before checking for additional topics that match the regular expression.
Oracle CDC Client origin - When a table includes a primary key, the origin includes the following new record header attributes:
- jdbc.primaryKeySpecification - Includes a JSON-formatted string with all primary keys in the table and related metadata. For example:
```
jdbc.primaryKeySpecification = {“<primary key name>":{"type":2,”datatype”:”VARCHAR","size":39,"precision":0,"scale":-127,"signed":true,"currency":true}, 
“primary key name 2":{"type":2,”datatype”:”VARCHAR”,"size":39,"precision":0,"scale":-127,"signed":true,"currency":true}}
```
- jdbc.primaryKey.before.<primary key column> - Includes the previous value for the specified primary key column.
- jdbc.primaryKey.after.<primary key column> - Includes the new value for the specified primary key column.
Pulsar Producer destination - Use the new Schema tab to specify the schema that Pulsar uses to validate the messages that the destination writes to a topic.
Salesforce stages - All Salesforce stages now use version 54.0 of the Salesforce API by default.

Connections when registered with Control Hub

When Data Collector version 6.1.0 is registered with Control Hub cloud or with Control Hub on-premises version 3.19.x or later, the following stages support using Control Hub connections:
- Aurora PostgreSQL CDC Client origin
- Azure Synapse SQL destination
- Google BigQuery executor
- Hive stages
- OPC UA Client origin
- Salesforce Bulk API 2.0 stages

Enterprise stage libraries

In May 2022, StreamSets released the following Enterprise stage libraries:

Azure Synapse 1.2.0
Databricks 1.6.0
Google 1.1.0
Oracle 1.4.0
Snowflake 1.11.0

For more information about these releases, see the Enterprise Libraries release notes.

Additional enhancements

Data Collector logs - Data Collector uses the Apache Log4j 2.17.2 library to write log data. In previous releases, Data Collector used the Apache Log4j 1.x library which is now end-of-life. This can have upgrade impact.
Elasticsearch 8.0 support - You can now use Elasticsearch stages to read from and write to Elasticsearch 8.0.
Credential stores property - A new credentialStores.usePortableGroups credential stores property enables migrating pipelines that access credential stores from one Control Hub organization to another. Contact StreamSets Support before enabling this option.

Upgrade Impact

Data Collector log configuration

With 5.0.0 and later, Data Collector uses the Apache Log4j 2.17.2 library to write log data. Data Collector includes the following log configuration files:

sdc-log4j2.properties
log4j2.component.properties

You can customize the log format by modifying the sdc-log4j2.properties file using the Log4j 2.x syntax. Do not modify the log4j2.component.properties file.

In previous releases, Data Collector used the Apache Log4j 1.x library which is now end-of-life. Data Collector included a single log configuration file, sdc-log4j.properties. You customized the log format by modifying the sdc-log4j.properties file using the Log4j 1.x syntax.

If you modified the log configuration file in a previous Data Collector version, you must compare the previous sdc-log4j.properties file with the new sdc-log4j2.properties file. Update the new file as needed with the same customized property values using the Log4j 2.x syntax. For details about the syntax, see the Log4j documentation.

Update Oracle CDC Client origin user accounts

With 5.0.0 and later, the Oracle CDC Client origin requires additional Oracle permissions to ensure appropriate handling of self-recovery, failover, and crash recovery.

Before you run pipelines that include the Oracle CDC Client origin, use the following GRANT statements to update the Oracle user account associated with the origin:

GRANT select on GV_$ARCHIVED_LOG to <user name>;
GRANT select on GV_$INSTANCE to <user name>;
GRANT select on GV_$LOG to <user name>; 
GRANT select on V_$INSTANCE to <user name>;

5.0.0 Fixed Issues

The Oracle CDC Client origin can fail if redo logs are rotated as the origin reads data from the current log. The origin can also fail when an Oracle RAC node fails or recovers from a failure or planned shut down.
With this fix, the Oracle CDC Client origin can recover from additional recovery and maintenance scenarios, and in a more efficient fashion. However, the fix requires configuring additional permissions for the Oracle user. For more information, see Upgrade Impact.
The Oracle CDC Client origin treats the underscore character ( _ ) as a single-character wildcard in schema names and table name patterns, disallowing the valid use of the character as an underscore character.
With this fix, you can use the character as an underscore by escaping it with a slash character ( / ). For example, to specify the NA_SALES table, you enter NA/_SALES.
Oracle CDC Client origin pipelines fail with null pointer exceptions when the origin is configured to buffer data locally to disk, instead of in memory.
When the Oracle CDC Client origin Convert Timestamp to String advanced property is enabled, the origin does not properly handle unparsable timestamps.
JDBC stages that read data, such as the JDBC Query Consumer origin or the JDBC Lookup processor, do not generate records after one of the JDBC stages encounters an error reading a table column.
The JDBC Query Consumer origin incorrectly generates a no-more-data event when the limit in a query matches the configured max batch size.
The JDBC Query Consumer origin is unable to read Oracle data of the Timestamp with Local Time Zone data type.

5.0.x Known Issues

There are no important known issues at this time.

4.4.x Release Notes

The Data Collector 4.4.x releases occurred on the following dates:

4.4.1 on March 24, 2022
4.4.0 on February 16, 2022

New Features and Enhancements

Updated stages

Amazon S3 stages - You can use an Amazon S3 stage to connect to Amazon S3 using a custom endpoint.
Amazon S3 destination - You can configure the destination to add tags to the Amazon S3 objects that it creates.
Base 64 Field Decoder and Encoder processors - You can configure the processors to decode or encode multiple fields.
Google BigQuery (Legacy) destination - The destination, formerly called Google BigQuery, has been renamed and deprecated with this release. The destination may be removed in a future release. We recommend that you use the Google BigQuery destination to write data to Google BigQuery, which supports processing CDC data and handling data drift.
Hive Query executor - You can use time functions in the SQL queries that execute on Hive or Impala. When using time functions, you can also select the time zone that the executor uses to evaluate the functions.
HTTP Client stages - You can configure additional security headers to include in the HTTP requests made by the stage. Use additional security headers when you want to include sensitive information, such as user names or passwords, in an HTTP header.
For example, you might use the credential:get() function in an additional security header to retrieve a password stored securely in a credential store.
HTTP Client processor - You can configure the processor to send a single request that contains all records in the batch.
JMS Producer destination - You can configure the destination to include record header attributes with a jms.header prefix as JMS message headers.

Pulsar stages - You can configure a Pulsar stage to use OAuth 2.0 authentication to connect to an Apache Pulsar cluster.
Pulsar Consumer (Legacy) origin - The origin creates a pulsar.topic record header attribute that includes the topic that the message was read from.
Salesforce stages - Salesforce stages now use version 53.1.0 of the Salesforce API by default.

Connections when registered with Control Hub

When Data Collector version 6.1.0 is registered with Control Hub cloud or with Control Hub on-premises version 3.19.x or later, the following stages support using Control Hub connections:
- CoAP Client destination
- Influx DB destination
- Influx DB 2.x destination
- Pulsar stages
Amazon S3 enhancement - The Amazon S3 connection supports connecting to Amazon S3 using a custom endpoint.

Credential stores

Google Secret Manager - You can configure Data Collector to authenticate with Google Secret Manager using credentials in a Google Cloud service account credentials JSON file.

Enterprise stage library

In February 2022, StreamSets released an updated Snowflake Enterprise stage library.

For more information about this release, see the Snowflake 1.10.0 release notes, available with the Enterprise Libraries release notes.

Enterprise stage libraries are free for use in both development and production.

Data Collector Edge

You can no longer download the Data Collector Edge executable from Data Collector.

If you have an enterprise license and would like to download Data Collector Edge, you can request the Data Collector Edge executable through the StreamSets Incident Management portal or Empower Portal.

Upgrade Impact

Encryption JAR file removed from Couchbase stage library

With Data Collector 4.4.0 and later, the Couchbase stage library no longer includes an encryption JAR file that the Couchbase stages do not directly use. Removing the JAR file should not affect pipelines using Couchbase stages.

However, if Couchbase pipelines display errors about classes or methods not being found, you can install the following encryption JAR file as an external library for the Couchbase stage library:

https://search.maven.org/artifact/com.couchbase.client/encryption/1.0.0/jar

To install an external library, see Install External Libraries.

4.4.1 Fixed Issues

In Data Collector version 4.4.0, the HTTP Client processor cannot write HTTP response data to an existing field. Earlier Data Collector versions are not affected by this issue.
When a Kubernetes pod that contains Data Collector shuts down while a pipeline that includes a MapR FS File Metadata or HDFS File Metadata executor is running, the executor cannot always perform the configured tasks.
Access to Control Hub through the Data Collector user interface times out.
Though this fix may have resolved the issue, as a best practice, use Control Hub to author pipelines instead of Data Collector.

4.4.0 Fixed Issues

To address recently-discovered vulnerabilities in Apache Log4j 2.17.0 and earlier 2.x versions, Data Collector 4.4.0 is packaged with Log4j 2.17.1. This is the latest available Log4j version, and contains fixes for all known issues.
The Oracle CDC Client origin does not correctly handle a daylight saving time change when configured to use a database time zone that uses daylight saving time.
The MapR DB CDC origin does not properly handle records with null values.
The Kafka Multitopic Consumer origin does not respect the configured Max Batch Wait Time.
A state notification webhook always uses the POST request method, even if configured to use a different request method.
When the HTTP Client origin uses OAuth authentication and the request returns 401 Unauthorized and 403 Forbidden statuses, the origin generates a new OAuth token indefinitely.
The MapR DB CDC origin incorrectly updates the offset during pipeline preview.
When Amazon stages are configured to assume another role and configured to connect to an endpoint, the stages do not redirect to the correct URL.
JDBC origins encounter an exception when reading data with an incorrect date format, instead of processing the record as an error record.
The Directory origin skips reading files that have the same timestamp.
The JDBC Multitable Consumer origin cannot use a wildcard character (%) in the Schema property.
The Azure Data Lake Storage Gen2 and Local FS destinations do not correctly shut down threads.

4.4.x Known Issue

In Data Collector 4.4.0, the HTTP Client processor cannot write HTTP response data to an existing field. Earlier Data Collector versions are not affected by this issue.
Workaround: If using Data Collector 4.4.0, upgrade to Data Collector 4.4.1, where this issue is fixed.

4.3.x Release Notes

The Data Collector 4.3.0 release occurred on January 13, 2022.

New Features and Enhancements

Internal update: This release includes internal updates to support an upcoming Control Hub feature in the StreamSets platform.

4.3.0 Fixed Issues

To address recently-discovered vulnerabilities in Apache Log4j 2.16.x and earlier 2.x versions, Data Collector 4.3.0 is packaged with Log4j 2.17.0. This is the latest available Log4j version, and contains fixes for all known issues.
Data Collector now sets a Java system property to help address the Apache Log4j known issues.

4.3.x Known Issues

There are no important known issues at this time.

4.2.x Release Notes

The Data Collector 4.2.x releases occurred on the following dates:

4.2.1 on December 23, 2021
4.2.0 on November 9, 2021

New Features and Enhancements

New support

Red Hat Enterprise Linux 8.x - Data Collector now supports installation on RHEL 8.x, in addition to 6.x and 7.x.

New stage

InfluxDB 2.x destination - Use the destination to write to InfluxDB 2.x databases.

Updated stages

Couchbase Lookup processor property name updates - For clarity, the following property names have been changed:
- Property Name is now Sub-Document Path.
- Return Properties is now Return Sub-Documents.
- SDC Field is now Output Field.
- When performing a key value lookup and configuring multiple return properties, the Property Mappings property is now Sub-Document Mappings.
- When performing an N1QL lookup and configuring multiple return properties, the Property Mappings property is now Sub-N1QL Mappings.
Einstein Analytics destination enhancements:
- The Einstein Analytics destination has been renamed the Tableau CRM destination to match the Salesforce rebranding.
- The new Tableau CRM destination can perform automatic recovery.
HTTP Client stage statistics - HTTP Client stages provide additional metrics when you monitor the pipeline.
PostgreSQL CDC Client origin - You can specify the SSL mode to use on the new Encryption tab of the origin.
Salesforce destination - The destination supports performing hard deletes when using the Salesforce Bulk API. Hard deletes permanently delete records, bypassing the Salesforce Recycle Bin.
Salesforce stages - Salesforce stages now use version 53.0.0 of the Salesforce API by default.
SFTP/FTP/FTPS stages - All SFTP/FTP/FTPS Client stages now support HTTP and Socks proxies.

Connections when registered with Control Hub

When Data Collector version 6.1.0 is registered with Control Hub cloud or with Control Hub on-premises version 3.19.x or later, the following stage supports using Control Hub connections:
- Cassandra destination
SFTP/FTP/FTPS enhancement - The SFTP/FTP/FTPS connection allows configuring the new SFTP/FTP/FTPS proxy properties.

Additional enhancements

Enabling HTTPS for Data Collector - You can now store the keystore file in the Data Collector resources directory, $SDC_RESOURCES, and then enter a path relative to that directory when you define the keystore location. This can have upgrade impact.
Google Secret Manager enhancement - You can configure a new enforceEntryGroup Google Secret Manager credential store property to validate a user’s group against a comma-separated list of groups allowed to access each secret.

Testing update

With this release, StreamSets no longer tests Data Collector against Cloudera CDH 5.x, which has been deprecated.

Upgrade Impact

Enabling HTTPS for Data Collector

With this release, when you enable HTTPS for Data Collector, you can store the keystore file in the Data Collector resources directory, $SDC_RESOURCES. You can then enter a path relative to that directory when you define the keystore location in the Data Collector configuration file.

In previous releases, you can store the keystore file in the Data Collector configuration directory, $SDC_CONF, and then define the location to the file using a path relative to that directory. You can continue to store the file in the configuration directory, but StreamSets recommends moving it to the resources directory when you upgrade.

Tableau CRM destination write behavior change

The write behavior of the Tableau CRM destination, previously known as the Einstein Analytics destination, has changed.

With this release, the destination writes to Salesforce by uploading batches of data to Salesforce, then signaling Salesforce to process the dataset after a configurable interval when no new data arrives. You configure the interval with the Dataset Wait Time stage property.

In versions 3.7.0 - 4.1.x, the destination signals Salesforce to process data after uploading each batch, effectively treating each batch as a dataset and making the Dataset Wait Time property irrelevant.

After upgrading from version 3.7.0 - 4.1.x, verify that the destination behavior is as expected. If necessary, update the Dataset Wait Time property to the interval that Salesforce should wait before processing each dataset.

When upgrading from a version prior to 3.7.0, no action is required. Versions prior to 3.7.0 behave like this release.

4.2.1 Fixed Issues

To address recently-discovered vulnerabilities in Apache Log4j 2.16.x and earlier 2.x versions, Data Collector 4.2.1 is packaged with Log4j 2.17.0. This is the latest available Log4j version, and contains fixes for all known issues.

Data Collector now sets a Java system property to help address the Apache Log4j known issues.
New permissions validation for the Oracle CDC Client origin added in Data Collector 4.2.0 are too strict. This fix returns the permissions validation to the same level as 4.1.x.

4.2.0 Fixed Issues

Oracle CDC Client origin pipelines can take up to 10 minutes to shut down due to Oracle driver and executor timeout policies. With this fix, those policies are bypassed while allowing all processes to complete gracefully.
The Oracle CDC Client origin can miss recovering transactional data when the pipeline unexpectedly stops when the origin is processing overlapping transactions.
The JDBC Producer destination does not properly write to partitioned PostgreSQL database tables.
The MongoDB destination cannot write null values to MongoDB.
The Salesforce Lookup processor does not properly handle SOQL queries that include single quotation marks.
Pipeline performance suffers when using the Azure Data Lake Storage Gen2 destination to write large batches of data in the Avro data format.
The MapR DB CDC origin does not properly handle records with deleted fields.
When configured to return only the first of multiple return values, the Couchbase Lookup processor creates multiple records instead.
The Tableau CRM destination, previously known as the Einstein Analytics destination, signals Salesforce to process data after each batch, effectively treating each batch as a dataset. This fix can have upgrade impact.

4.2.x Known Issues

There are no important known issues at this time.

4.1.x Release Notes

The Data Collector 4.1.0 release occurred on August 18, 2021.

New Features and Enhancements

Use the StreamSets platform to access Data Collector

Existing customers can continue to access Data Collector downloads using the StreamSets Support Portal.

All others users, including community users, can no longer download Data Collector through StreamSets Accounts. Instead, use the StreamSets platform to deploy Data Collector engines and to design and run Data Collector pipelines.

New to the StreamSets platform? Sign up and try it for free.

New stage

Google Cloud Storage executor - You can use this executor to create new objects, copy or move objects, or add metadata to new or existing objects.

Stage type enhancements

Amazon stages - When you configure the Region property, you can select from several additional regions.
Kudu stages - The default value for the Maximum Number of Worker Threads property is now 2. Previously, the default was 0, which used the Kudu default.
Existing pipelines are not affected by this change.
Orchestration stages - You can use an expression when you configure the Control Hub URL property in orchestration stages.
Salesforce stages - All Salesforce stages now support using version 52.2.0 of the Salesforce API.
Scripting processors - In the Groovy Evaluator, JavaScript Evaluator, and Jython Evaluator processors, you can select the Script Error as Record Error property to have the stage handle script errors based on how the On Record Error property is configured for the stage.

Origin enhancements

Google Cloud Storage origin - You can configure post processing actions to take on objects that the origin reads.
MySQL Binary Log origin - The origin now recovers automatically from the following issues:
- Lost, damaged, or unestablished connections.
- Exceptions raised from MySQL Binary Log being out-of-sync in some cluster nodes, or from being unable to communicate with the MySQL Binary Log origin.
Oracle CDC Client origin:
- The origin includes a Batch Wait Time property that determines how long the origin waits for data before sending an empty or partial batch through the pipeline.
- The origin provides additional LogMiner metrics when you monitor a pipeline.
RabbitMQ Consumer origin - You can configure the origin to read from quorum queues by adding x-queue-type as a Declaration Property in the Queue tab and setting it to quorum.

Processor enhancements

SQL Parser processor - You can configure the processor to use the Oracle PEG parser instead of the default parser.

Destination enhancements

Google BigQuery (Legacy) destination - The destination now supports writing Decimal data to Google BigQuery Decimal columns.
MongoDB destination - You can use the Improve Type Conversion property to improve how the destination handles date and decimal data.
RabbitMQ Producer destination - You can configure the destination to write to quorum queues by adding x-queue-type as a Declaration Property in the Queue tab and setting it to quorum.
Splunk destination - You can use the Additional HTTP Headers property to define additional key-value pairs of HTTP headers to data written to Splunk.

Credential stores

New Google Secret Manager support - You can use Google Secret Manager as a credential store for Data Collector.
CyberArk enhancement - You can configure the credentialStore.cyberark.config.ws.proxyURI property to allow defining the URI for the proxy that should be used to reach the CyberArk services.

Enterprise stage libraries

In October 2021, StreamSets released the following new Enterprise stage library:

Google

In September 2021, StreamSets released updates for the following Enterprise stage libraries:

Azure Synapse
Databricks
Oracle
Snowflake

For more information about these releases, see the Enterprise Libraries release notes.

Connections when registered with Control Hub

When Data Collector version 6.1.0 is registered with Control Hub cloud or with Control Hub on-premises version 3.19.x or later, the following stages support using Control Hub connections:
- MongoDB stages
- RabbitMQ stages
- Redis stages

Salesforce enhancement - The Salesforce connection includes the following role properties:
- Use Snowflake Role
- Snowflake Role Name

Stage libraries

This release includes the following new stage library:


Stage Library Name	Description
streamsets-datacollector-apache-kafka_2_8-lib	For Apache Kafka 2.8.0.

Additional enhancements

Excel data format enhancement - Stages that support reading the Excel data format include an Include Cells With Empty Value property to include empty cells in records.

4.1.0 Fixed Issues

Due to an issue with an underlying library, HTTP connections can fail when Keep-Alive is disabled.
Stages that need to parse a large number of JSON, CSV, or XML files might exceed the file descriptors limit because the stages don't release them appropriately.
Data Collector does not properly handle Avro schemas with nested Union fields.
Errors occur when using HBase stages with the CDH 6.0.x - 6.3.x or CDP 7.1 stage libraries when the HBase column name includes more than one colon (:).
When the HTTP Lookup processor paginates by page number, it can enter an endless retry loop when reading the last page of data.
The JDBC Lookup processor does not support expressions for table names when validating column mappings.
Note: Validating column mappings for multiple tables can slow pipeline performance because all table columns defined in the column mappings must be validated before processing can begin.
The Kudu Lookup processor and Kudu destination do not release resources under certain circumstances.
When reading data with a query that uses the MAX or MIN operators, the SQL Server CDC Client origin can take a long time to start processing data.

4.1.x Known Issues

There are no important known issues at this time.

4.0.x Release Notes

The Data Collector 4.1.x releases occurred on the following dates:

4.0.2 - June 23, 2021
4.0.1 - June 7, 2021
4.0.0 - May 25, 2021

New Features and Enhancements

Stage enhancements

Control Hub orchestration stages - Orchestration stages use API credentials to connect to Control Hub in the StreamSets platform. This affects the following stages:
Kafka stages - Kafka stages include an Override Stage Configurations property that enables custom Kafka properties defined in the stage to override other stage properties.
This can impact existing pipelines.
MapR Streams stages - MapR Streams stages also include an Override Stage Configurations property that enables the additional MapR or Kafka properties defined in the stage to override other stage properties.
This can impact existing pipelines.
Salesforce stages - The Salesforce origin, processor, destination, and the Tableau CRM destination include the following new timeout properties:
- Connection Handshake Timeout
- Subscribe Timeout
Oracle CDC Client origin:
- You can now remove Oracle pseudocolumns from parsed redo logs and place the information in record header attributes.
- The origin includes an oracle.cdc.oracle.pseudocolumn.<pseudocolumn name> attribute for each pseudocolumn in the original statement.
- Starting with version 4.0.1, the origin includes a Batch Wait Time property.
Field Type Converter processor - The Source Field is Empty property enables to you specify the action to take when an input field is an empty string.
HTTP Client processor:
- Two Pass Records properties allow you to pass a record through the pipeline when all retries fail for per-status actions and for timeouts.
- The following record header attributes are populated when you use one of the Pass Records properties:
  - httpClientError
  - httpClientStatus
  - httpClientLastAction
  - httpClientTimeoutType
  - httpClientRetries
SQL Parser processor:
- You can now remove Oracle pseudocolumns from parsed redo logs and place the information in record header attributes.
- The processor includes an oracle.cdc.oracle.pseudocolumn.<pseudocolumn name> attribute for each pseudocolumn in the original statement.

Connections when registered with Control Hub

When Data Collector version 4.0.0 is registered with Control Hub cloud or with Control Hub on-premises version 3.19.x or later, the following stages support using Control Hub connections:

Oracle CDC Client origin
SQL Server CDC Client origin
SQL Server Change Tracking Client origin

Enterprise stage libraries

In June 2021, StreamSets released new versions of the Databricks and Snowflake Enterprise stage libraries.

For more information about the new features, fixed issues, and known issues for those releases, see their release notes in the StreamSets Release Notes page.

Additional features

SDC_EXTERNAL_RESOURCES environment variable - An optional root directory for external resources, such as custom stage libraries, external libraries, and runtime resources.
The default location is $SDC_DIST/externalResources.
Support Bundle - Support bundles now include the System Messages log file when you include log files in the bundle.

Deprecated features

Several features and stages have been deprecated with this release and may be removed in a future release. We recommend that you avoid using these features and stages. For a full list, click here.

Upgrade Impact

Conflicting properties in Kafka and MapR Streams stages

In previous releases, if you specify an additional configuration property that conflicts with a stage property setting in a Kafka or MapR Streams stage, the stage property takes precedence.

With this release, a conflict generates an error. You can use the new Override Stage Configurations property to enable the Kafka or MapR configuration property to take precedence. Or, you can remove or update the configuration property to allow the stage property to take precedence.

Control Hub On-premises prerequisite task

Before using Data Collector 4.0.0 or later versions with Control Hub On-premises, you must complete a prerequisite task. For details, see the StreamSets Support portal.

HTTP Client processor batch wait time change

With this release, the HTTP Client processor performs additional checks against the specified batch wait time. This can affect existing pipelines. For details, see Review HTTP Client Processor Pipelines.

Open source status

Data Collector 4.0.0 and later versions are not open source. This means that StreamSets will not make the source code publicly available.

This change should not impact customers with a paid subscription to Data Collector. Users who download the free, open source version of Data Collector will be able to use the SaaS-based alternative that will be launched soon.

All earlier versions of Data Collector, which are open source, remain available on GitHub.

Stages removed

The following stages have been deprecated for several years and have been removed from Data Collector with this release:

HTTP to Kafka origin
SDC RPC to Kafka origin
UDP to Kafka origin

Updated environment variable default (tarball installation, manual start)

For manually-started tarball installations, the default location for the SDC_RESOURCES environment variable has changed from $SDC_DIST/resources to $SDC_EXTERNAL_RESOURCES/resources, which evaluates to: $SDC_DIST/externalResources/resources.

If your installation has SDC_RESOURCES set to a directory outside of the $SDC_DIST runtime directory as described in the installation instructions, or if you do not use SDC_RESOURCES, no action is required.

If your installation has SDC_RESOURCES set to a directory inside the $SDC_DIST runtime directory, you should move the $SDC_RESOURCES directory outside of the $SDC_DIST runtime directory and set the SDC_RESOURCES variable to the new location, as specified in the upgrade instructions.

For all other installations, the default for the SDC_RESOURCES environment variable remains the same.

4.0.2 Fixed Issues

The JDBC Producer destination can round the scale of numeric data when it performs multi-row operations while writing to SQL Server tables.
You cannot use API user credentials in Orchestration stages.

4.0.1 Fixed Issue

In the JDBC Lookup processor, enabling the Validate Column Mappings property when using an expression to represent the lookup table generates an invalid SQL query.
Though fixed, using column mapping validation with an expression for the table name requires querying the database for all column names. As a result, the response time can be slower than expected.

4.0.0 Fixed Issues

The SQL Server CDC Client origin does not process data correctly when configured to generate schema change events.
The Hadoop FS destination stages fail to recover temporary files when the directory template includes pipeline parameters or expressions.
The Oracle CDC Client origin can generate an exception when trying to process data from a transaction after the same partially-processed transaction has already been flushed after exceeding the maximum transaction length.
The Oracle CDC Client origin fails to start when it is configured to start from a timestamp or SCN that is contained in multiple database incarnations.
Some conditional expressions in the Field Mapper processor can cause errors when operating on field names.
HTTP Client stages should not log the proxy password when the Data Collector logging mode is set to Debug.
HTTP Client Processor can create creating duplicate requests when Pagination Mode is set to None.
The MQTT Subscriber origin does not properly restore a persistent session.
The Oracle CDC Client origin generates an exception when Oracle includes an empty string in a redo log statement, which is unexpected. With this fix, the origin interprets empty strings as NULL.
Data Collector uses a Java version specified in the PATH environment variable over the version defined in the JAVA_HOME environment variable.

4.0.x Known Issues

There are no important known issues at this time.