Release Notes

5.2.x Release Notes

The Data Collector 5.2.0 release occurred on September 29, 2022.

New Features and Enhancements

New stages
  • MongoDB Atlas origin and destination - You can use the new MongoDB Atlas origin and destination to read from and write to MongoDB Atlas and MongoDB Enterprise Server.
Stage enhancements
  • Groovy stages - The Groovy Scripting origin and the Groovy Evaluator processor now support Groovy 4.0.
  • JDBC Multitable Consumer origin - The origin now provides the jdbc.primaryKeySpecification record header attribute for records from tables with a primary key and the jdbc.vendor record header attribute for all records.
  • JDBC Tee processor and JDBC Producer destination - These stages can manage primary key values updates using the jdbc.primaryKey.before.columnName record header attribute for the old value and the jdbc.primaryKey.before.columnName record header attribute for the new value.
  • MQTT stages - The MQTT Subscriber origin and the MQTT Publisher destination now support entering a list of brokers from high availability MQTT clusters without a load balancer.
  • Oracle CDC Client origin:
    • Conditional Blob and Clob support - When the origin buffers changes locally, you can configure the origin to process Blob and Clob data using the following advanced properties:
      • Enable Blob and Clob Columns Processing property - Enable this property to process Blob and Clob columns.
      • Maximum LOB Size property - Optional property to define the maximum LOB size. When specified, overflow data is discarded.
    • LogMiner Query Timeout property - This property defines how long the origin waits for a LogMiner query to complete.
    • Time between Session Windows property - This advanced property sets the time to wait after a LogMiner session has been completely ingested. This ensures a minimum LogMiner window size.
    • Time after Session Window Start property - This advanced property sets the time to wait after a LogMiner session starts. This allows Oracle to finish setting up before processing begins.
  • Pipeline Finisher executor - The executor includes new React to Events and Event Type properties that enables the executor to stop a pipeline only upon receiving the specified event record type.

    For example, you can now configure the executor to stop the pipeline only after receiving a no-more-data event record, and to ignore all other records that it might receive. Previously, you might have used a precondition or a Filter processor to ensure that the executor received only no-more-data events.

  • Snowflake stages - Snowflake stages have been updated to support all Snowflake regions.
  • SQL Server CDC Client origin - The origin can be configured to combine the two update records that SQL Server creates into a single record and generate the records differently.

    With this property enabled, the origin generates record header attributes about the primary key.

  • SQL Server Change Tracking origin - The origin generates record header attributes about the primary key.
Connections
With this release, the following stages support using Control Hub connections:
Stage libraries

This release includes the following new stage libraries:

  • streamsets-datacollector-groovy_4_0-lib

  • streamsets-datacollector-mongodb_atlas-lib

Upgrade Impact

Review MySQL Binary Log pipelines

With 5.2.0, the MySQL Binary Log origin converts MySQL Enum and Set fields to String fields.

In previous releases, when reading from a database where the binlog_row_metadata MySQL database property is set to MINIMAL, Enum fields are converted to Long, and Set fields are converted to Integer.

In 5.2.0 as well as previous releases, when the binlog_row_metadata MySQL database property is set to FULL, Enum and Set fields are converted to String.

After you upgrade to 5.2.0, review MySQL Binary Log pipelines that process Enum and Set data from a database with binlog_row_metadata set to MINIMAL. Update the pipeline as needed to ensure that Enum and Set data is processed as expected.
Review Oracle CDC Client pipelines

With 5.2.0, the Oracle CDC Client origin has new advanced properties that enable processing Blob and Clob columns. You can use these properties when the origin buffers changes locally. They are disabled by default.

In previous releases, the origin does not process Blob or Clob columns. However, when the Unsupported Fields to Records property is enabled, the origin includes Blob and Clob field names and raw string values in records.

Due to a known issue with this release, when the origin is not configured to process Blob and Clob columns and when the Unsupported Fields to Records property is enabled, the origin continues to include Blob and Clob field names and raw string values in records. When the property is disabled, the origin includes Blob and Clob field names with null values. The expected behavior is to always include field names with null values unless the origin is configured to process Blob and Clob columns.
Review Oracle CDC Client pipelines to assess how they should handle Blob and Clob fields:
  • To process Blob and Clob columns, enable Blob and Clob processing on the Advanced tab. You can optionally define a maximum LOB size.

    Verify that sufficient memory is available to Data Collector before enabling Blob and Clob processing.

  • If the origin has the Unsupported Fields to Records property enabled, the origin continues to include Blob and Clob field names and raw string values, as in previous releases.

    If the origin has the Unsupported Fields to Records property disabled, and if null values are acceptable for Blob and Clob fields, then no action is required at this time.

    In a future release, this behavior will change so the Unsupported Fields to Records property has no effect on how Blob and Clob columns are processed.

5.2.0 Fixed Issues

  • The MySQL Binary Log origin converts Enum and Set fields to different field types based on how the binlog_row_metadata database property is set. This fix has upgrade impact.
  • The Include Deleted Records property in the Salesforce Lookup processor does not display.

  • The Salesforce Bulk API destination can encounter problems when generating error records.

  • Pipeline parameters do not work properly with required list properties.

5.2.x Known Issues

  • Data Collector cannot locate a separate runtime properties file that has been uploaded as an external resource for the engine.
    Workaround: Define runtime properties in the Data Collector configuration properties instead of in a separate runtime properties file.
  • The java.security.networkaddress.cache.ttl Data Collector configuration property does not cache Domain Name Service (DNS) lookups as expected.
  • When the Oracle CDC Client origin is not configured to process Blob or Clob columns, the origin includes Blob and Clob field names in the record with either null values or raw string values depending on whether the Unsupported Fields to Records property is enabled. This issue has upgrade impact.

5.1.x Release Notes

The Data Collector 5.1.0 release occurred in July 2022.

New Features and Enhancements

New stage
  • Pulsar Consumer origin - The new Pulsar Consumer origin can use multiple threads to read from Pulsar. The origin supports schema validation and Pulsar namespaces configured to enforce schema validation. You can specify the schema used to determine compatibility between the origin and a Pulsar topic. You can also use JWT authentication with the new origin.

    With this new origin, the existing Pulsar Consumer has been renamed Pulsar Consumer (Legacy). Use this new Pulsar origin for all new development.

Stage enhancements
  • Aurora PostgreSQL CDC Client and PostgreSQL CDC Client origins - Both origins can now generate a record for each individual operation.

    Previously, the origins could only generate a record for each transaction.

  • Aurora PostgreSQL CDC Client, PostgreSQL CDC Client, and MySQL Binary Log origins - These origins include the following new record header attributes when a table includes a primary key:
    • jdbc.primaryKeySpecification - Includes a JSON-formatted string that lists the columns that form the primary key in the table and the metadata for those columns.
    • jdbc.primaryKey.before.<primary key column> - Includes the previous value for the specified primary key column.
    • jdbc.primaryKey.after.<primary key column> - Includes the new value for the specified primary key column.
  • Kafka Multitopic Consumer origin and Kafka Producer destination - The stages include a new Custom Authentication security option that enables specifying custom properties that contain the information required by a security protocol, rather than using predefined properties associated with other security options.
  • OPC UA Server origin - The origin now supports using a user name and password to authenticate with the OPC UA server, in addition to an anonymous log in.
  • Oracle CDC Client origin:
    • The origin now supports reading from Oracle 21c databases.
    • The field order of generated records now matches the column order in database tables. Previously, the field order was not guaranteed.
    • When you configure the origin to use local buffers and write to disk, you can specify an existing directory to use.
    • A new Data Collector configuration property affects the origin and can have upgrade impact. For details, see Upgrade Impact.
  • Pulsar Consumer (Legacy) origin:
    • The origin, formerly named Pulsar Consumer, has been renamed with this release.

      This change has no upgrade impact. However, we recommend using the new Pulsar Consumer origin, which supports multithreaded processing, to read from Pulsar.

    • You can specify the schema used to determine compatibility between the origin and a Pulsar topic.
    • You can also use JWT authentication with the origin.
  • Salesforce Bulk API 2.0 stages:
    • All Salesforce Bulk API 2.0 stages include a new Salesforce Query Timeout property which defines the number of seconds that the stage waits for a response to a query.
    • The Salesforce Bulk API 2.0 origin and Salesforce Bulk API 2.0 Lookup processor both include a new Maximum Query Columns property that limits the number of columns that can be returned by a query.
  • Scripting stages - The Groovy, JavaScript, and Jython Evaluator origins and processors now generate metrics for script execution and locking details that you can view when monitoring the pipeline.
  • Field Remover - The processor now includes the On Record Error property on the General tab.
  • Field Type Converter processor - When converting a Date, Datetime, or Time field, the Date Format property now offers explicit options to specify that the field contains a Unix timestamp in milliseconds or seconds.

    If the field contains a Unix timestamp and you select an alternate date format, then the behavior is unchanged: the processor assumes the timestamps are in milliseconds.

  • SQL Parser processor:
    • The processor adds the fields from the SQL statement in the same order as the corresponding columns in the database tables.
    • The processor now includes field attributes for columns converted to the Decimal or Datetime data types in Data Collector. The attributes provide additional information for each field.
  • Pulsar Producer destination - You can now use JWT authentication with the destination.
Connection enhancements
  • Kafka connection - The connection includes a new Custom Authentication security option that enables specifying custom properties that contain the information required by a security protocol, rather than using predefined properties associated with other security options.
  • OPC UA connection - The connection now supports using a user name and password to authenticate with the OPC UA server in addition to an anonymous log in to the server.
  • Pulsar connection - The connection can now use JWT authentication to connect to Pulsar.
  • Snowflake connection - The new Connection Properties property enables you to specify additional connection properties for Snowflake connections.
Additional enhancements
  • Data Collector Docker image - The Docker image for Data Collector 5.1.0, streamsets/datacollector:5.1.0, uses Ubuntu 20.04 LTS (Focal Fossa) as a parent image. This change can have upgrade impact.
  • Microsoft JDBC Driver for SQL Server - Data Collector uses version 10.2.1 of the driver to connect to Microsoft SQL Server. Due to changes in the driver, this can have upgrade impact.
  • Runtime parameters - You can use runtime parameters to represent a stage or pipeline property that displays as a list of configurations. For example, you can use a runtime parameter to define the Additional JDBC Configuration Properties for the JDBC Query Consumer origin.
  • Data Collector configuration properties - You can define the following new configuration properties:
    • stage.conf_com.streamsets.pipeline.lib.jdbc.disableSSL - When this configuration property is set to true, Data Collector attempts to disable SSL for all JDBC connections. This property is commented out by default.
    • stage.conf_com.streamsets.pipeline.stage.origin.jdbc.cdc.oracle.monitorbuffersize - When this configuration property is set to true, Data Collector reports memory consumption when the Oracle CDC Client origin uses local buffers.

      This property is set to false by default. In previous releases, the origin reported this information by default, so this enhancement has upgrade impact.

Stage libraries
This release includes the following new stage libraries:
  • streamsets-datacollector-apache-kafka_3_0-lib - For Apache Kafka 3.0.
  • streamsets-datacollector-apache-kafka_3_1-lib - For Apache Kafka 3.1.
  • streamsets-datacollector-apache-kafka_3_2-lib - For Apache Kafka 3.2.

Upgrade Impact

Review SQL Server pipelines without SSL/TLS encrypted connections
With 5.1.0, Data Collector uses Microsoft JDBC Driver for SQL Server version 10.2.1 to connect to Microsoft SQL Server. According to Microsoft, this version has introduced a breaking backward-incompatible change.
As a result, after you upgrade to Data Collector 5.1.0, upgraded pipelines that connect to Microsoft SQL Server without SSL/TLS encryption will likely fail with a message such as the following:
The driver could not establish a secure connection to SQL Server by using Secure Sockets Layer (SSL) encryption.
This issue can be resolved by configuring SSL/TLS encryption between Microsoft SQL Server and Data Collector. For details on configuring clients for SSL/TLS encryption, see the Microsoft SQL Server documentation.
Otherwise, you can address this issue at a pipeline level by adding encrypt=false to the connection string, or by adding encrypt as an additional JDBC property and setting it to false.
To avoid having to update all affected pipelines immediately, you can configure Data Collector to attempt to disable SSL/TLS for all pipelines that use a JDBC driver. To do so, set the stage.conf_com.streamsets.pipeline.lib.jdbc.disableSSL Data Collector configuration property to true. Note that this property affects all JDBC drivers, and should typically be used only as a stopgap measure. For more information about the configuration property, see Configuring Data Collector.
Review reporting requirements for Oracle CDC Client pipelines
With 5.1.0, pipelines that include the Oracle CDC Client origin no longer report memory consumption data when the origin uses local buffers. In previous releases, this reporting occurred by default, which slowed pipeline performance.
After you upgrade to 5.1.0, memory consumption reporting for Oracle CDC Client local buffer usage is no longer performed by default. If you require this information, you can enable it by setting the stage.conf_com.streamsets.pipeline.stage.origin.jdbc.cdc.oracle.monitorbuffersize Data Collector configuration property to true.

This property enables memory consumption data reporting for all Oracle CDC Client pipelines that use local buffering. Because it slows pipeline performance, as a best practice, enable the property only for short term troubleshooting.

Review Dockerfiles for custom Docker images
In previous releases, the Data Collector Docker image used Alpine Linux as a parent image. Due to limitations in Alpine Linux, with this release the Data Collector Docker image uses Ubuntu 20.04 LTS (Focal Fossa) as a parent image.

If you build custom Data Collector images using earlier releases of streamsets/datacollector as the parent image, review your Dockerfiles and make all required updates to become compatible with Ubuntu Focal Fossa before you build a custom image based on streamsets/datacollector:5.1.0.

5.1.0 Fixed Issues

  • When the Oracle CDC Client origin is configured to use the PEG Parser for processing, the jdbc.primaryKey.before.<primary key column> and the jdbc.primaryKey.after.<primary key column> record header attributes are not set correctly.
  • If an Oracle RAC node is force stopped, the Oracle CDC client origin can stop producing records, although it is actually mining through a LogMiner session. A new LogMiner session needs to be recreated instead of retrying until system stabilizes. This issue is related to the internal tasks that Oracle runs after restarting a crashed node.
  • Data loss can occur when the Oracle CDC Client origin does not use local buffering and the pipeline is stopped while a transaction that contains multiple operations and spans several seconds is being processed.
  • Data Collector can fail to load a PostgreSQL driver correctly when you have pipelines that use different stages to access PostgreSQL.
  • The Maximum Parallel Requests property in the HTTP Client processor and destination does not work as expected.

    This property was removed from the processor and destination because these stages do not support parallel requests.

  • The MySQL Binary Log origin can fail when the order of columns in a source table changes.

    This issue is fixed when using the origin from Data Collector version 5.1.0 or later to read from MySQL 8.0 or later. However, you must set the binlog_row_metadata MySQL configuration property to FULL.

  • The MySQL Binary Log origin can stall and stop processing due to a problem with an internal queue. If you attempt to stop the pipeline at that time, the pipeline can become non-responsive.
  • The SFTP/FTP/FTPS connection does not include private key properties that enable configuring the connection to use private key authentication.

5.1.x Known Issues

There are no important known issues at this time.

5.0.x Release Notes

The Data Collector 5.0.0 release occurred on April 29, 2022.

New Features and Enhancements

New stages
Updated stages
  • HTTP Client enhancements - When using OAuth2 authentication with HTTP Client stages, you can configure the following new properties:
    • Use Custom Assertion and Assertion Key Type - Use these properties to specify a custom parameter for passing the JSON Web Token (JWT).
    • JWT Headers - Use to specify headers to include in the JWT.
  • JMS Producer destination - You can configure the destination to remove the jms.header prefix from record header attribute names before including the information as headers in the JMS messages.
  • Kafka Multitopic Consumer origin - The origin includes the following new properties:
    • Topic Subscription Type and Topic Pattern - Use these two properties to specify a regular expression that defines the topic names to read from, instead of simply listing the topic names.
    • Metadata Refresh Time - Specify the milliseconds to wait before checking for additional topics that match the regular expression.
  • Oracle CDC Client origin - When a table includes a primary key, the origin includes the following new record header attributes:
    • jdbc.primaryKeySpecification - Includes a JSON-formatted string with all primary keys in the table and related metadata. For example:
      jdbc.primaryKeySpecification = {“<primary key name>":{"type":2,”datatype”:”VARCHAR","size":39,"precision":0,"scale":-127,"signed":true,"currency":true}, 
      “primary key name 2":{"type":2,”datatype”:”VARCHAR”,"size":39,"precision":0,"scale":-127,"signed":true,"currency":true}}
    • jdbc.primaryKey.before.<primary key column> - Includes the previous value for the specified primary key column.
    • jdbc.primaryKey.after.<primary key column> - Includes the new value for the specified primary key column.
  • Pulsar Producer destination - Use the new Schema tab to specify the schema that Pulsar uses to validate the messages that the destination writes to a topic.
  • Salesforce stages - All Salesforce stages now use version 54.0 of the Salesforce API by default.
  • Start Jobs origin and processor - You can now configure the following job instance properties in the Start Jobs origin and processor:
    • Delete from Job Instances List when Completed
    • Attach Instances to Template
Connections
  • With this release, the following stages support using Control Hub connections:
    • Aurora PostgreSQL CDC Client origin
    • Azure Synapse SQL destination

    • Google BigQuery (Enterprise) executor
    • Hive stages
    • OPC UA Client origin
    • Salesforce Bulk API 2.0 stages
Enterprise libraries
In May 2022, StreamSets released the following Enterprise stage libraries:
  • Azure Synapse 1.2.0
  • Databricks 1.6.0
  • Google 1.1.0
  • Oracle 1.4.0
  • Snowflake 1.11.0

For more information about these releases, see the Enterprise Libraries release notes.

Additional enhancements
  • Data Collector logs - Data Collector uses the Apache Log4j 2.17.2 library to write log data. In previous releases, Data Collector used the Apache Log4j 1.x library which is now end-of-life.
  • Proxy server configuration - To configure Data Collector to use a proxy server for outbound network requests, define proxy properties when you set up the deployment.

    Previously, you configured Data Collector to use a proxy server by defining Java configuration options for the deployment and then setting the STREAMSETS_BOOTSTRAP_JAVA_OPTS environment variable on the Data Collector machine.

  • Elasticsearch 8.0 support - You can now use Elasticsearch stages to read from and write to Elasticsearch 8.0.
  • Credential stores property - A new credentialStores.usePortableGroups credential stores property enables migrating pipelines that access credential stores from one Control Hub organization to another. Contact StreamSets Support before enabling this option.
Stages removed

Google BigQuery (legacy) destination - With this release, the previously-deprecated Google BigQuery (legacy) destination is no longer available for use.

To write to Google BigQuery, use the Google BigQuery (Enterprise) destination.

Upgrade Impact

Update Oracle CDC Client origin user accounts
With 5.0.0 and later, the Oracle CDC Client origin requires additional Oracle permissions to ensure appropriate handling of self-recovery, failover, and crash recovery.
Before you run pipelines that include the Oracle CDC Client origin, use the following GRANT statements to update the Oracle user account associated with the origin:
GRANT select on GV_$ARCHIVED_LOG to <user name>;
GRANT select on GV_$INSTANCE to <user name>;
GRANT select on GV_$LOG to <user name>; 
GRANT select on V_$INSTANCE to <user name>;

5.0.0 Fixed Issues

  • The Oracle CDC Client origin can fail if redo logs are rotated as the origin reads data from the current log. The origin can also fail when an Oracle RAC node fails or recovers from a failure or planned shut down.

    With this fix, the Oracle CDC Client origin can recover from additional recovery and maintenance scenarios, and in a more efficient fashion. However, the fix requires configuring additional permissions for the Oracle user. For more information, see Upgrade Impact.

  • The Oracle CDC Client origin treats the underscore character ( _ ) as a single-character wildcard in schema names and table name patterns, disallowing the valid use of the character as an underscore character.

    With this fix, you can use the character as an underscore by escaping it with a slash character ( / ). For example, to specify the NA_SALES table, you enter NA/_SALES.

  • Oracle CDC Client origin pipelines fail with null pointer exceptions when the origin is configured to buffer data locally to disk, instead of in memory.
  • When the Oracle CDC Client origin Convert Timestamp to String advanced property is enabled, the origin does not properly handle unparsable timestamps.
  • JDBC stages that read data, such as the JDBC Query Consumer origin or the JDBC Lookup processor, do not generate records after one of the JDBC stages encounters an error reading a table column.
  • The JDBC Query Consumer origin incorrectly generates a no-more-data event when the limit in a query matches the configured max batch size.
  • The JDBC Query Consumer origin is unable to read Oracle data of the Timestamp with Local Time Zone data type.

5.0.x Known Issues

There are no important known issues at this time.

4.4.x Release Notes

The Data Collector 4.4.x releases occurred on the following dates:
  • 4.4.1 on March 24, 2022
  • 4.4.0 on February 16, 2022

New Features and Enhancements

Updated stages
  • Amazon S3 stages - You can use an Amazon S3 stage to connect to Amazon S3 using a custom endpoint.
  • Amazon S3 destination - You can configure the destination to add tags to the Amazon S3 objects that it creates.
  • Base 64 Field Decoder and Encoder processors - You can configure the processors to decode or encode multiple fields.
  • Google BigQuery (Legacy) destination - The destination, formerly called Google BigQuery, has been renamed and deprecated with this release. The destination may be removed in a future release. We recommend that you use the Google BigQuery (Enterprise) destination to write data to Google BigQuery, which supports processing CDC data and handling data drift.
  • Hive Query executor - You can use time functions in the SQL queries that execute on Hive or Impala. When using time functions, you can also select the time zone that the executor uses to evaluate the functions.
  • HTTP Client stages - You can configure additional security headers to include in the HTTP requests made by the stage. Use additional security headers when you want to include sensitive information, such as user names or passwords, in an HTTP header.

    For example, you might use the credential:get() function in an additional security header to retrieve a password stored securely in a credential store.

  • HTTP Client processor - You can configure the processor to send a single request that contains all records in the batch.
  • JMS Producer destination - You can configure the destination to include record header attributes with a jms.header prefix as JMS message headers.
  • Pulsar stages - You can configure a Pulsar stage to use OAuth 2.0 authentication to connect to an Apache Pulsar cluster.
  • Pulsar Consumer (Legacy) origin - The origin creates a pulsar.topic record header attribute that includes the topic that the message was read from.
  • Salesforce stages - Salesforce stages now use version 53.1.0 of the Salesforce API by default.
Connections
  • With this release, the following stages support using Control Hub connections:
    • CoAP Client destination
    • Influx DB destination
    • Influx DB 2.x destination
    • Pulsar stages
  • Amazon S3 enhancement - The Amazon S3 connection supports connecting to Amazon S3 using a custom endpoint.
Credential stores
  • Google Secret Manager - You can configure Data Collector to authenticate with Google Secret Manager using credentials in a Google Cloud service account credentials JSON file.
Enterprise Library

In February 2022, StreamSets released an updated Snowflake Enterprise stage library.

For more information about the Snowflake release, see the Snowflake 1.10.0 release notes, available with the Enterprise Libraries Release Notes.

Enterprise stage libraries are free for use in both development and production.

Upgrade Impact

Encryption JAR file removed from Couchbase stage library
With Data Collector 4.4.0 and later, the Couchbase stage library no longer includes an encryption JAR file that the Couchbase stages do not directly use. Removing the JAR file should not affect pipelines using Couchbase stages.
However, if Couchbase pipelines display errors about classes or methods not being found, you can install the following encryption JAR file as an external library for the Couchbase stage library:

https://search.maven.org/artifact/com.couchbase.client/encryption/1.0.0/jar

To install an external library, see Install External Libraries.

4.4.1 Fixed Issues

  • In Data Collector version 4.4.0, the HTTP Client processor cannot write HTTP response data to an existing field. Earlier Data Collector versions are not affected by this issue.
  • When a Kubernetes pod that contains Data Collector shuts down while a pipeline that includes a MapR FS File Metadata or HDFS File Metadata executor is running, the executor cannot always perform the configured tasks.
  • Access to Control Hub through the Data Collector user interface times out.

    Though this fix may have resolved the issue, as a best practice, use Control Hub to author pipelines instead of Data Collector.

4.4.0 Fixed Issues

  • To address recently-discovered vulnerabilities in Apache Log4j 2.17.0 and earlier 2.x versions, Data Collector 4.4.0 is packaged with Log4j 2.17.1. This is the latest available Log4j version, and contains fixes for all known issues.
  • The Oracle CDC Client origin does not correctly handle a daylight saving time change when configured to use a database time zone that uses daylight saving time.
  • The MapR DB CDC origin does not properly handle records with null values.
  • The Kafka Multitopic Consumer origin does not respect the configured Max Batch Wait Time.
  • A state notification webhook always uses the POST request method, even if configured to use a different request method.
  • When the HTTP Client origin uses OAuth authentication and the request returns 401 Unauthorized and 403 Forbidden statuses, the origin generates a new OAuth token indefinitely.
  • The MapR DB CDC origin incorrectly updates the offset during pipeline preview.
  • When Amazon stages are configured to assume another role and configured to connect to an endpoint, the stages do not redirect to the correct URL.
  • JDBC origins encounter an exception when reading data with an incorrect date format, instead of processing the record as an error record.
  • The Directory origin skips reading files that have the same timestamp.
  • The JDBC Multitable Consumer origin cannot use a wildcard character (%) in the Schema property.
  • The Azure Data Lake Storage Gen2 and Local FS destinations do not correctly shut down threads.
  • When using WebSocket tunneling for browser to engine communication, Data Collector cannot use a proxy server for outbound requests made to Control Hub.
  • You can only use the Google Secret Manager credential store with self-managed deployments and GCE deployments. You cannot use the credential store with Amazon EC2 deployments at this time.

4.4.x Known Issue

  • In Data Collector 4.4.0, the HTTP Client processor cannot write HTTP response data to an existing field. Earlier Data Collector versions are not affected by this issue.

    Workaround: If using Data Collector 4.4.0, upgrade to Data Collector 4.4.1, where this issue is fixed.

4.3.x Release Notes

The Data Collector 4.3.0 release occurred on January 13, 2022.

New Features and Enhancements

Internal update
This release includes internal updates to support an upcoming StreamSets DataOps Platform Control Hub feature.
Note: All new Data Collector deployments on StreamSets DataOps Platform will use Data Collector version 4.3.0 or higher. Existing deployments are not affected.

4.3.0 Fixed Issues

  • To address recently-discovered vulnerabilities in Apache Log4j 2.16.x and earlier 2.x versions, Data Collector 4.3.0 is packaged with Log4j 2.17.0. This is the latest available Log4j version, and contains fixes for all known issues.
  • Data Collector now sets a Java system property to help address the Apache Log4j known issues.

4.3.x Known Issues

  • You can only use the Google Secret Manager credential store with self-managed deployments and GCE deployments. You cannot use the credential store with Amazon EC2 deployments at this time.

4.2.x Release Notes

The Data Collector 4.2.x releases occurred on the following dates:
  • 4.2.1 on December 23, 2021
  • 4.2.0 on November 9, 2021

New Features and Enhancements

New support
New stage
Updated stages
  • Couchbase Lookup processor property name updates - For clarity, the following property names have been changed:
    • Property Name is now Sub-Document Path.
    • Return Properties is now Return Sub-Documents.
    • SDC Field is now Output Field.
    • When performing a key value lookup and configuring multiple return properties, the Property Mappings property is now Sub-Document Mappings.
    • When performing an N1QL lookup and configuring multiple return properties, the Property Mappings property is now Sub-N1QL Mappings.
  • Einstein Analytics destination enhancements:
  • HTTP Client stage statistics - HTTP Client stages provide additional metrics when you monitor the pipeline.
  • PostgreSQL CDC Client origin - You can specify the SSL mode to use on the new Encryption tab of the origin.
  • Salesforce destination - The destination supports performing hard deletes when using the Salesforce Bulk API. Hard deletes permanently delete records, bypassing the Salesforce Recycle Bin.
  • Salesforce stages - Salesforce stages now use version 53.0.0 of the Salesforce API by default.
  • SFTP/FTP/FTPS stages - All SFTP/FTP/FTPS Client stages now support HTTP and Socks proxies.
Connections
  • With this release, the following stage supports using Control Hub connections:
    • Cassandra destination
  • SFTP/FTP/FTPS enhancement - The SFTP/FTP/FTPS connection allows configuring the new SFTP/FTP/FTPS proxy properties.
Additional enhancements
  • Enabling HTTPS for Data Collector - You can now store the keystore file in the Data Collector resources directory, <installation_dir>/externalResources/resources, and then enter a path relative to that directory when you define the keystore location. This can have upgrade impact.
  • Google Secret Manager enhancement - You can configure a new enforceEntryGroup Google Secret Manager credential store property to validate a user’s group against a comma-separated list of groups allowed to access each secret.

Upgrade Impact

Enabling HTTPS for Data Collector
With this release, when you enable HTTPS for Data Collector, you can store the keystore file in the Data Collector resources directory, <installation_dir>/externalResources/resources. You can then enter a path relative to that directory when you define the keystore location in the Data Collector configuration properties.
In previous releases, you can store the keystore file in the Data Collector configuration directory, <installation_dir>/etc, and then define the location to the file using a path relative to that directory. You can continue to store the file in the configuration directory, but StreamSets recommends moving it to the resources directory when you upgrade.
Tableau CRM destination write behavior change
The write behavior of the Tableau CRM destination, previously known as the Einstein Analytics destination, has changed.

With this release, the destination writes to Salesforce by uploading batches of data to Salesforce, then signaling Salesforce to process the dataset after a configurable interval when no new data arrives. You configure the interval with the Dataset Wait Time stage property.

In versions 3.7.0 - 4.1.x, the destination signals Salesforce to process data after uploading each batch, effectively treating each batch as a dataset and making the Dataset Wait Time property irrelevant.

After upgrading from version 3.7.0 - 4.1.x, verify that the destination behavior is as expected. If necessary, update the Dataset Wait Time property to the interval that Salesforce should wait before processing each dataset.

When upgrading from a version prior to 3.7.0, no action is required. Versions prior to 3.7.0 behave like this release.

4.2.1 Fixed Issues

  • To address recently-discovered vulnerabilities in Apache Log4j 2.16.x and earlier 2.x versions, Data Collector 4.2.1 is packaged with Log4j 2.17.0. This is the latest available Log4j version, and contains fixes for all known issues.
  • Data Collector now sets a Java system property to help address the Apache Log4j known issues.
  • New permissions validation for the Oracle CDC Client origin added in Data Collector 4.2.0 are too strict. This fix returns the permissions validation to the same level as 4.1.x.

4.2.0 Fixed Issues

  • Oracle CDC Client origin pipelines can take up to 10 minutes to shut down due to Oracle driver and executor timeout policies. With this fix, those policies are bypassed while allowing all processes to complete gracefully.
  • The Oracle CDC Client origin can miss recovering transactional data when the pipeline unexpectedly stops when the origin is processing overlapping transactions.
  • The JDBC Producer destination does not properly write to partitioned PostgreSQL database tables.
  • The MongoDB destination cannot write null values to MongoDB.
  • The Salesforce Lookup processor does not properly handle SOQL queries that include single quotation marks.
  • Pipeline performance suffers when using the Azure Data Lake Storage Gen2 destination to write large batches of data in the Avro data format.
  • The MapR DB CDC origin does not properly handle records with deleted fields.
  • When configured to return only the first of multiple return values, the Couchbase Lookup processor creates multiple records instead.
  • The Tableau CRM destination, previously known as the Einstein Analytics destination, signals Salesforce to process data after each batch, effectively treating each batch as a dataset. This fix can have upgrade impact.

4.2.x Known Issues

  • You can only use the Google Secret Manager credential store with self-managed deployments and GCE deployments. You cannot use the credential store with Amazon EC2 deployments at this time.

4.1.x Release Notes

The Data Collector 4.1.0 release occurred on August 18, 2021.

New Features and Enhancements

New stage
  • Google Cloud Storage executor - You can use this executor to create new objects, copy or move objects, or add metadata to new or existing objects.
Stage type enhancements
  • Amazon stages - When you configure the Region property, you can select from several additional regions.
  • Kudu stages - The default value for the Maximum Number of Worker Threads property is now 2. Previously, the default was 0, which used the Kudu default.

    Existing pipelines are not affected by this change.

  • Orchestration stages - You can use an expression when you configure the Control Hub URL property in orchestration stages.
  • Salesforce stages - All Salesforce stages now support using version 52.2.0 of the Salesforce API.
  • Scripting processors - In the Groovy Evaluator, JavaScript Evaluator, and Jython Evaluator processors, you can select the Script Error as Record Error property to have the stage handle script errors based on how the On Record Error property is configured for the stage.
Origin enhancements
  • Google Cloud Storage origin - You can configure post processing actions to take on objects that the origin reads.
  • MySQL Binary Log origin - The origin now recovers automatically from the following issues:
    • Lost, damaged, or unestablished connections.
    • Exceptions raised from MySQL Binary Log being out-of-sync in some cluster nodes, or from being unable to communicate with the MySQL Binary Log origin.
  • Oracle CDC Client origin:
    • The origin includes a Batch Wait Time property that determines how long the origin waits for data before sending an empty or partial batch through the pipeline.
    • The origin provides additional LogMiner metrics when you monitor a pipeline.
  • RabbitMQ Consumer origin - You can configure the origin to read from quorum queues by adding x-queue-type as an Additional Client Configuration property and setting it to quorum.
Processor enhancements
  • SQL Parser processor - You can configure the processor to use the Oracle PEG parser instead of the default parser.
Destination enhancements
  • Google BigQuery (Legacy) destination - The destination now supports writing Decimal data to Google BigQuery Decimal columns.
  • MongoDB destination - You can use the Improve Type Conversion property to improve how the destination handles date and decimal data.
  • Splunk destination - You can use the Additional HTTP Headers property to define additional key-value pairs of HTTP headers to data written to Splunk.
Credential stores
  • New Google Secret Manager support - You can use Google Secret Manager as a credential store for Data Collector.
  • Cyberark enhancement - You can configure the credentialStore.cyberark.config.ws.proxyURI property to allow defining the URI for the proxy that should be used to reach the CyberArk services.
Enterprise Stage Libraries
In October 2021, StreamSets released the following new Enterprise stage library:
  • Google
In September 2021, StreamSets released updates for the following Enterprise stage libraries:
  • Azure Synapse
  • Databricks
  • Oracle
  • Snowflake

For more information about the new features, fixed issues, and known issues in an Enterprise stage library, see the Enterprise stage library release notes. For a list of available Enterprise libraries, see Enterprise Stage Libraries.

Connections
  • With this release, the following stages support using Control Hub connections:
    • MongoDB stages
    • RabbitMQ stages
    • Redis stages
  • Salesforce enhancement - The Salesforce connection includes the following role properties:
    • Use Snowflake Role
    • Snowflake Role Name
Stage libraries
This release includes the following new stage library:
Stage Library Name Description
streamsets-datacollector-apache-kafka_2_8-lib For Apache Kafka 2.8.0.
Additional enhancements
  • Excel data format enhancement - Stages that support reading the Excel data format include an Include Cells With Empty Value property to include empty cells in records.

4.1.0 Fixed Issues

  • Due to an issue with an underlying library, HTTP connections can fail when Keep-Alive is disabled.
  • Stages that need to parse a large number of JSON, CSV, or XML files might exceed the file descriptors limit because the stages don't release them appropriately.
  • Data Collector does not properly handle Avro schemas with nested Union fields.
  • Errors occur when using HBase stages with the CDH 6.0.x - 6.3.x or CDP 7.1 stage libraries when the HBase column name includes more than one colon (:).
  • When the HTTP Lookup processor paginates by page number, it can enter an endless retry loop when reading the last page of data.
  • The JDBC Lookup processor does not support expressions for table names when validating column mappings.
    Note: Validating column mappings for multiple tables can slow pipeline performance because all table columns defined in the column mappings must be validated before processing can begin.
  • The Kudu Lookup processor and Kudu destination do not release resources under certain circumstances.
  • When reading data with a query that uses the MAX or MIN operators, the SQL Server CDC Client origin can take a long time to start processing data.

4.1.x Known Issues

  • You can only use the Google Secret Manager credential store with self-managed deployments and GCE deployments. You cannot use the credential store with Amazon EC2 deployments at this time.

4.0.x Release Notes

The Data Collector 4.1.x releases occurred on the following dates:
  • 4.0.2 - June 23, 2021
  • 4.0.1 - June 7, 2021
  • 4.0.0 - May 25, 2021

New Features and Enhancements

Stage enhancements
  • Control Hub orchestration stages - Orchestration stages use API credentials to connect to DataOps Platform Control Hub. This affects the following stages:
  • Kafka stages - Kafka stages include an Override Stage Configurations property that enables custom Kafka properties defined in the stage to override other stage properties.
  • MapR Streams stages - MapR Streams stages also include an Override Stage Configurations property that enables the additional MapR or Kafka properties defined in the stage to override other stage properties.
  • Salesforce stages - The Salesforce origin, processor, destination, and the Tableau CRM destination include the following new timeout properties:
    • Connection Handshake Timeout
    • Subscribe Timeout
  • Oracle CDC Client origin:
    • You can now remove Oracle pseudocolumns from parsed redo logs and place the information in record header attributes.
    • The origin includes an oracle.cdc.oracle.pseudocolumn.<pseudocolumn name> attribute for each pseudocolumn in the original statement.
    • Starting with version 4.0.1, the origin includes a Batch Wait Time property.
  • Field Type Converter processor - The Source Field is Empty property enables to you specify the action to take when an input field is an empty string.
  • HTTP Client processor:
    • Two Pass Records properties allow you to pass a record through the pipeline when all retries fail for per-status actions and for timeouts.
    • The following record header attributes are populated when you use one of the Pass Records properties:
      • httpClientError
      • httpClientStatus
      • httpClientLastAction
      • httpClientTimeoutType
      • httpClientRetries
  • SQL Parser processor:
    • You can now remove Oracle pseudocolumns from parsed redo logs and place the information in record header attributes.
    • The processor includes an oracle.cdc.oracle.pseudocolumn.<pseudocolumn name> attribute for each pseudocolumn in the original statement.
Connections
With this release, the following stages support using Control Hub connections:
  • Oracle CDC Client origin
  • SQL Server CDC Client origin
  • SQL Server Change Tracking Client origin
Enterprise stage libraries
In June 2021, StreamSets released new versions of the Databricks and Snowflake Enterprise stage libraries.
For more information about the new features, fixed issues, and known issues for those releases, see their release notes in the StreamSets Release Notes page.
For a list of available Enterprise libraries, see Enterprise Stage Libraries.
Additional features
  • SDC_EXTERNAL_RESOURCES environment variable - An optional root directory for external resources, such as custom stage libraries, external libraries, and runtime resources.

    The default location is $SDC_DIST/externalResources.

  • Support Bundle - Support bundles now include the System Messages log file when you include log files in the bundle.

4.0.2 Fixed Issues

  • The JDBC Producer destination can round the scale of numeric data when it performs multi-row operations while writing to SQL Server tables.
  • You cannot use API user credentials in Orchestration stages.

4.0.1 Fixed Issue

  • In the JDBC Lookup processor, enabling the Validate Column Mappings property when using an expression to represent the lookup table generates an invalid SQL query.

    Though fixed, using column mapping validation with an expression for the table name requires querying the database for all column names. As a result, the response time can be slower than expected.

4.0.0 Fixed Issues

  • The SQL Server CDC Client origin does not process data correctly when configured to generate schema change events.
  • The Hadoop FS destination stages fail to recover temporary files when the directory template includes pipeline parameters or expressions.

  • The Oracle CDC Client origin can generate an exception when trying to process data from a transaction after the same partially-processed transaction has already been flushed after exceeding the maximum transaction length.
  • The Oracle CDC Client origin fails to start when it is configured to start from a timestamp or SCN that is contained in multiple database incarnations.
  • Some conditional expressions in the Field Mapper processor can cause errors when operating on field names.
  • HTTP Client stages should not log the proxy password when the Data Collector logging mode is set to Debug.
  • HTTP Client Processor can create creating duplicate requests when Pagination Mode is set to None.

  • The MQTT Subscriber origin does not properly restore a persistent session.

  • The Oracle CDC Client origin generates an exception when Oracle includes an empty string in a redo log statement, which is unexpected. With this fix, the origin interprets empty strings as NULL.
  • Data Collector uses a Java version specified in the PATH environment variable over the version defined in the JAVA_HOME environment variable.

4.0.x Known Issues

There are no important known issues at this time.