Post Upgrade Tasks

Update Control Hub On-Premises

By default, StreamSets Control Hub on-premises can work with registered Data Collectors from version 2.1.0.0 to the current version of Control Hub. If you use Control Hub on-premises and you upgrade registered Data Collectors to a version higher than your current version of Control Hub, you might need to modify the Data Collector version range within your Control Hub installation.

For example, if you use Control Hub on-premises version 3.8.0 and you upgrade registered Data Collectors to version 5.10.0, you must update the maximum Data Collector version that can work with Control Hub. As a best practice, configure the maximum Data Collector version to 5.99.999 to ensure that Data Collector upgrades to later minor versions, such as 5.11.0 or 5.12.0, will continue to work with Control Hub.

Note: If you register Data Collector version 3.19.x or later with Control Hub on-premises version 3.18.x or earlier, then some stages in the Control Hub Pipeline Designer display a Connection property that is not supported. Do not change the property from the default value of None. If you select Choose Value or use a parameter to define the property, Pipeline Designer hides the remaining connection properties and the pipeline fails to run.

To modify the Data Collector version range:

  1. Log in to Control Hub as the default system administrator - the admin@admin user account.
  2. In the Navigation panel, click Administration > Data Collectors.
  3. Click the Component Version Range icon: .
  4. Enter the maximum Data Collector version that can work with Control Hub, such as 5.99.999.

Update Pipelines using Legacy Stage Libraries

When you upgrade, review the complete list of legacy stage libraries. If your upgraded pipelines use these legacy stage libraries, the pipelines will not run until you perform one of the following tasks:
Use a current stage library
We strongly recommend that you upgrade your system and use a current stage library in the pipeline:
  1. Upgrade the system to a more current version.
  2. Install the stage library for the upgraded system.
  3. In the pipeline, edit the stage and select the appropriate stage library.
Install the legacy stage library
Though not recommended, you can install the older stage libraries. For more information, see Legacy Stage Libraries.

Review the Oracle CDC Client Record Cache Size

Starting with version 5.10.0, you can configure the maximum size of the record cache for an Oracle CDC Client origin using the Records Cache Size property. Upgraded pipelines are given the default value of -2, which represents two times the batch size.

After upgrading to Data Collector 5.10.0 or later, verify that Oracle CDC Client origins are configured with the appropriate record cache size.

Review Search Mode Behavior for Start Jobs Pipelines

Starting with version 5.10.0, Start Jobs stages have updated search mode options. Pipelines upgraded from version 5.2.x or earlier that were configured with the contain search mode option are updated to use the new contains unique search mode option.

After upgrading to Data Collector 5.10.0 or later from Data Collector 5.2.x or earlier, verify that Start Jobs pipelines are using the appropriate search mode.

Review the Maximum Batch Vault Size for Oracle CDC Origin Pipelines

Starting with version 5.9.0, the Oracle CDC origin has a Max Batch Vault Size property that allows you to configure the maximum number of batches the origin pre-generates while the pipeline is processing other batches. In upgraded pipelines, the origin uses the default maximum batch vault size of 64.

After you upgrade to version 5.9.0 or later, review Oracle CDC origin pipelines. If the maximum batch vault size is not appropriate, update the pipelines accordingly.

Review Amazon, Azure, Data Parser, JMS Consumer, and Pulsar Consumer Origin Pipelines

Starting with version 5.9.0, the following origins no longer read tables that contain multiple columns with the same name:
  • Amazon S3

  • Amazon SQS Consumer

  • Azure Blob Storage

  • Azure Data Lake Storage Gen2

  • Data Parser

  • JMS Consumer

  • Pulsar Consumer

  • Pulsar Consumer (Legacy)

When configured to read tables that contain duplicate column names, the origin treats the tables as invalid and generates an error.

After you upgrade to version 5.9.0 or later, review pipelines that use these origins. If any pipelines require the ability to read tables containing multiple columns with the same name, configure the origins to ignore column headers.

Review Oracle Bulkload Origin Pipelines

Starting with version 5.8.0, pipelines using the Oracle Bulkload origin no longer fail when the origin encounters an empty table. This change might cause Oracle Bulkload pipelines created with earlier versions of Data Collector to behave in unexpected ways.

After you upgrade to version 5.8.0 or later, review any pipelines that use the Oracle Bulkload origin to ensure they behave as expected.

Update stages that were using Enterprise stage libraries

Starting with version 5.8.0, Data Collector no longer supports Enterprise stage libraries.

After you upgrade to 5.8.0 or later, update stages using any of the following Enterprise stage libraries by installing the stage library as a custom stage library:
  • GPSS
  • MemSQL
  • Protector
  • Microsoft SQL Server 2019 Big Data Cluster
  • Teradata

Grant Users View Access for the Oracle CDC Origin

Starting with version 5.7.0, the Oracle CDC origin must use a user account with access to the all_tab_cols view.

After you upgrade to version 5.7.0 or later, run the following command in Oracle to grant the user account access to the view:
grant select on all_tab_cols to <user name>;

For CDB databases, run the command from the root container, cdb$root. Then run it again from the pluggable database. For non-CDB databases, run the command from the primary database.

Review Amazon S3, Azure Blob Storage, and Azure Data Lake Storage Gen2 Origin Pipelines

Starting with version 5.7.0, Amazon S3, Azure Blob Storage, and Azure Data Lake Storage Gen2 origins have a File Processing Delay property that allows you to configure the minimum number of milliseconds that must pass from the time a file is created before it is processed. In upgraded pipelines these origins receive the default file processing delay of 10,000 milliseconds.

After you upgrade to version 5.7.0 or later, review pipelines that include the Amazon S3, Azure Blob Storage, and Azure Data Lake Storage Gen2 origins. If the 10,000 millisecond delay is not appropriate, update the pipelines accordingly.

Review Amazon S3 and Databricks Delta Lake Stages

Starting with version 5.6.0, you can no longer include the forward slash (/) in the following properties due to an Amazon Web Services (AWS) SDK upgrade:
  • Bucket property for the Amazon S3 origin
  • Bucket and path property for the Amazon S3 destination and executor
  • Bucket property for the Databricks Delta Lake destination when staging files to Amazon S3

For more information about this change, see the aws-sdk-java list of Amazon S3 bug fixes.

As a result, you can define only the bucket name in these bucket properties. Use the following properties for each stage to define the path to an object inside the bucket:
  • Amazon S3 origin - Common Prefix and Prefix Pattern properties
  • Amazon S3 destination - Common Prefix and Partition Prefix properties
  • Amazon S3 executor - Object property on the Tasks tab
  • Databricks Delta Lake destination - Stage File Prefix property on the Staging tab

After you upgrade to version 5.6.0 or later, review the bucket property in these stages to ensure that the property defines the bucket name only. Modify the properties as needed to define only the bucket name in the bucket property and to define the path in the remaining properties.

For example, if an Amazon S3 origin configured in an earlier Data Collector version defines the properties as follows:
  • Bucket: orders/US/West
  • Common Prefix:
  • Prefix Pattern: **/*.log
Update the properties as follows:
  • Bucket: orders
  • Common Prefix: US/West/
  • Prefix Pattern: **/*.log

Install the Databricks Stage Library

Starting with version 5.6.0, the Databricks Delta Lake destination, Databricks Query executor, and Databricks Delta Lake connection require the Databricks stage library. In previous releases, they required the Databricks Enterprise stage library.

After you upgrade to version 5.6.0 or later, install the Databricks stage library, streamsets-datacollector-sdc-databricks-lib, to enable pipelines and jobs that use these Databricks stages or the Databricks connection to run as expected.

Review Databricks Stages

Starting with version 5.6.0, the scheme of the URL or connection string for the Databricks Delta Lake destination and Databricks Query executor is jdbc:databricks rather than jdbc:spark.

After you upgrade to version 5.6.0 or later, review the JDBC URL property in the Databricks Delta Lake destination and the JDBC Connection String property in the Databricks Query executor to ensure that the scheme resolves to jdbc:databricks.
Note: The upgrade process does not update runtime parameters. You must manually change runtime parameters that define the URL or connection string.

Update the Databricks Delta Lake Connection

Starting with version 5.6.0, the scheme of the URL is jdbc:databricks rather than jdbc.spark.

After you update a connection to use a version 5.6.0 or later authoring Data Collector, edit the JDBC URL property to use the jdbc:databricks scheme.

Review Scripts in Jython Stages

Starting with version 5.6.0, Jython stages uses Jython 2.7.3 to process data.

After you upgrade to version 5.6.0 or later, review the scripts used in the Jython Scripting origin and the Jython Evaluator processor to ensure that they process data as expected.

Install the JDBC Oracle Stage Library

Starting with version 5.6.0, the Oracle Bulkload origin requires the JDBC Oracle stage library. In previous releases, the origin required the Oracle Enterprise stage library.

After you upgrade to version 5.6.0 or later, install the JDBC Oracle stage library, streamsets-datacollector-sdc-databricks-lib, to enable pipelines and jobs that use the Oracle Bulkload origin to run as expected.

Grant Users View Access for the Oracle CDC Origin

Starting with version 5.6.0, the Oracle CDC origin requires that the configured database user has access to the v$containers view.

After you upgrade to version 5.6.0 or later, run the following command in Oracle to grant the user account access to the view:
grant select on v$containers to <user name>;

For CDB databases, run the command from the root container, cdb$root. Then run it again from the pluggable database. For non-CDB databases, run the command from the primary database.

Update Origins and Processors that Read Compressed Files

Starting with version 5.6.0, origins that read compressed files require you to set the Compression Library property to properly read files compressed with the Airlift version of Snappy. Destinations compress files with the Airlift version of Snappy. This affects the HTTP Client processor and the following origins:
  • Amazon S3
  • Azure Blob Storage
  • Azure Data Lake Storage Gen1
  • Azure Data Lake Storage Gen2 (Legacy)
  • Azure IoT/Event Hub Consumer
  • CoAP Server
  • Directory
  • File Tail
  • Hadoop FS Standalone
  • Google Cloud Storage
  • Google Pub/Sub Subscriber
  • gRPC Client
  • HTTP Client
  • HTTP Server
  • Kafka Multitopic Consumer
  • MQTT Subscriber
  • REST Service
  • SFTP/FTP/FTPS Client
  • TCP Server
  • WebSocket Client
  • WebSocket Server

After you upgrade to version 5.6.0 or later, review your pipelines. In any origins and processors that read files compressed using the Airlift version of Snappy, including files produced by destinations, set the Compression Library property to Snappy (Airlift Snappy).

Install the Azure stage library

Starting with version 5.5.0, the Azure Synapse SQL destination and Azure Synapse connection require the installation of the Azure stage library. In previous releases, the destination and connection required the Azure Synapse Enterprise stage library.

After you upgrade to version 5.5.0 or later, install the Azure stage library, streamsets-datacollector-azure-lib, so that pipelines and jobs that use the Azure Synapse SQL destination or connection run as expected.

Review Salesforce pipelines

Starting with version 5.5.0, Salesforce stages correctly import date values as dates rather than as strings.

After you upgrade to version 5.5.0 or later, review pipelines with Salesforce stages and ensure that they do not expect dates to be imported as strings.

Review OPC UA Client Pipelines

Starting with version 5.5.0, the OPC UA Client origin no longer includes the Max Array Length or Max String Length properties. These properties were removed because they are redundant. The existing Max Message Size property properly limits the message size regardless of the data type of the message.

After you upgrade to version 5.5.0 or later, review OPC UA Client pipelines to ensure that the configuration for the Max Message Size property is appropriate for the pipeline. The default maximum message size is 2097152.

Install the Snowflake Stage Library to Use Snowflake

Starting with version 5.4.0, using Snowflake stages and Snowflake connections requires installing the Snowflake stage library. In previous releases, Snowflake stages and connections were available with the Snowflake Enterprise stage library.

After you upgrade to 5.4.0 or later, install the Snowflake stage library, streamsets-datacollector-sdc-snowflake-lib, to enable pipelines and jobs that use Snowflake stages or connections to run as expected.

Install the Google Cloud Stage Library to Use BigQuery

Starting with version 5.3.0, using Google BigQuery stages and Google BigQuery connections requires installing the Google Cloud stage library. In previous releases, BigQuery stages and connections were available with the Google BigQuery Enterprise stage library.

After you upgrade to version 5.3.0 or later, install the Google Cloud stage library, streamsets-datacollector-google-cloud-lib, to enable pipelines and jobs using BigQuery stages or connections to run as expected.

Review JDBC Multitable Consumer Pipelines

Starting with version 5.3.0, the Minimum Idle Connections property in the JDBC Multitable Consumer origin cannot be set higher than the Number of Threads property. In previous releases, there was no limit to the number of minimum idle connections that you could configure.

Upgraded pipelines have the Minimum Idle Connections property set to the same value as the Number of Threads property.

After you upgrade to version 5.3.0 or later, review JDBC Multitable Consumer origin pipelines to ensure that the new value for the Minimum Idle Connections property is appropriate for each pipeline.

Review Missing Field Behavior for Field Replacer Processors

Starting with version 5.3.0, the advanced Field Does Not Exist property in the Field Replacer processor has the following two new options that replace the Include without Processing option:
  • Add New Field - Adds the fields defined on the Replace tab to records if they do not exist.
  • Ignore New Field - Ignores any fields defined on the Replace tab if they do not exist.

After you upgrade to version 5.3.0 or later, the Field Does Not Exist property is set to Add New Field. Review Field Replacer pipelines to ensure that this behavior is appropriate.

Review runtime:loadResource Pipelines

Starting with version 5.3.0, pipelines that include the runtime:loadResource function fail with errors when the function calls a missing or empty resource file. In previous releases, those pipelines sometimes continued to run without errors.

After you upgrade to version 5.3.0 or later, review pipelines that use the runtime:loadresource function and ensure that the function calls resource files that include the required information.

Manage Underscores in Snowflake Connection Information

Starting with the Snowflake JDBC driver 3.13.25 release in November 2022, the Snowflake JDBC driver converts underscores to hyphens, by default. This can adversely affect communicating with Snowflake when Snowflake connection information specified in a Snowflake stage or connection, such as a URL, includes underscores.

After you upgrade to Snowflake JDBC driver 3.13.25 or later, review your Snowflake connection information for underscores.

When needed, you can bypass the default driver behavior by setting the allowUnderscoresInHost driver property to true. For more information and alternate solutions, see this Snowflake community article.

Review MySQL Binary Log Pipelines

Starting with version 5.2.0, the MySQL Binary Log origin converts MySQL Enum and Set fields to String fields.

In previous releases, when reading from a database where the binlog_row_metadata MySQL database property is set to MINIMAL, Enum fields are converted to Long, and Set fields are converted to Integer.

In version 5.2.0 as well as previous releases, when the binlog_row_metadata MySQL database property is set to FULL, Enum and Set fields are converted to String.

After you upgrade to version 5.2.0, review MySQL Binary Log pipelines that process Enum and Set data from a database with binlog_row_metadata set to MINIMAL. Update the pipeline as needed to ensure that Enum and Set data is processed as expected.

Review Blob and Clob Processing in Oracle CDC Client Pipelines

Starting with version 5.2.0, the Oracle CDC Client origin has new advanced properties that enable processing Blob and Clob columns. You can use these properties when the origin buffers changes locally. They are disabled by default.

In previous releases, the origin does not process Blob or Clob columns. However, when the Unsupported Fields to Records property is enabled, the origin includes Blob and Clob field names and raw string values.

Due to a known issue with this release, when the origin is not configured to process Blob and Clob columns and when the Unsupported Fields to Records property is enabled, the origin continues to include Blob and Clob field names and raw string values. When the property is disabled, the origin includes Blob and Clob field names with null values. The expected behavior is to always include field names with null values unless the origin is configured to process Blob and Clob columns.

Review Oracle CDC Client pipelines to assess how they should handle Blob and Clob columns:
  • To process Blob and Clob columns, enable Blob and Clob processing on the Advanced tab. You can optionally define a maximum LOB size.

    Verify that sufficient memory is available to Data Collector before enabling Blob and Clob processing.

  • If the origin has the Unsupported Fields to Records property enabled, the origin continues to include Blob and Clob field names and raw string values, as in previous releases.

    If the origin has the Unsupported Fields to Records property disabled, and if null values are acceptable for Blob and Clob fields, then no action is required at this time.

    In a future release, this behavior will change so the Unsupported Fields to Records property has no effect on how Blob and Clob columns are processed.

Review Error Handling for Snowflake CDC Pipelines

In previous releases of the Snowflake Enterprise stage library, when the Snowflake destination runs a MERGE query that fails to write all CDC data in a batch to Snowflake, the Snowflake destination generates a stage error indicating that there was a difference between the number of records expected to be written and the number of records actually written to Snowflake.

The destination does not provide additional detail because Snowflake does not provide information about the individual records that failed to be written when a query fails.

Starting with version 1.12.0 of the Snowflake Enterprise stage library, when a query that writes CDC data fails, in addition to generating the stage error, the Snowflake destination passes all records in the batch to error handling. As a result, the error records are handled based on the error handling configured for the stage and pipeline.

Review stage and pipeline error handling for Snowflake CDC pipelines to ensure that error records are handled appropriately.

Note: The error records passed to error handling have been processed by the Snowflake destination. For example, if the batch includes three records that update the same row, they are merged into a single update record.

Review SQL Server Pipelines with Unencrypted Connections

Starting with version 5.1.0, Data Collector uses Microsoft JDBC Driver for SQL Server version 10.2.1 to connect to Microsoft SQL Server. According to Microsoft, this version has introduced a breaking backward-incompatible change.

As a result, after you upgrade to 5.1.0 or later, upgraded pipelines that connect to Microsoft SQL Server without SSL/TLS encryption will likely fail with a message such as the following:
The driver could not establish a secure connection to SQL Server by using Secure Sockets Layer (SSL) encryption.

This issue can be resolved by configuring SSL/TLS encryption between Microsoft SQL Server and Data Collector. For details on configuring clients for SSL/TLS encryption, see the Microsoft SQL Server documentation.

Otherwise, you can address this issue at a pipeline level by adding encrypt=false to the connection string, or by adding encrypt as an additional JDBC property and setting it to false.

To avoid having to update all affected pipelines immediately, you can configure Data Collector to attempt to disable SSL/TLS for all pipelines that use a JDBC driver. To do so, set the stage.conf_com.streamsets.pipeline.lib.jdbc.disableSSL Data Collector configuration property to true. Note that this property affects all JDBC drivers, and should typically be used only as a stopgap measure. For more information about the configuration property, see Configuring Data Collector.

Review Dockerfiles for Custom Docker Images

Starting with version 5.1.0, the Data Collector Docker image uses Ubuntu 20.04 LTS (Focal Fossa) as a parent image. In previous releases, the Data Collector Docker image used Alpine Linux as a parent image.

If you build custom Data Collector images using streamsets/datacollector version 5.0.0 or earlier as the parent image, review your Dockerfiles and make all required updates to become compatible with Ubuntu Focal Fossa before you build a custom image based on streamsets/datacollector:5.1.0 or later versions.

Review Oracle CDC Client Local Buffer Pipelines

Starting with version 5.1.0, pipelines that include the Oracle CDC Client origin no longer report memory consumption data when the origin uses local buffers. In previous releases, this reporting occurred by default, which slowed pipeline performance.

After you upgrade to Data Collector 5.1.0 or later, memory consumption reporting for Oracle CDC Client local buffer usage is no longer performed by default. If you require this information, you can enable it by setting the stage.conf_com.streamsets.pipeline.stage.origin.jdbc.cdc.oracle.monitorbuffersize Data Collector configuration property to true.

This property enables memory consumption data reporting for all Oracle CDC Client pipelines that use local buffering. Because it slows pipeline performance, as a best practice, enable the property only for short term troubleshooting.

Update Oracle CDC Client Origin User Accounts

Starting with version 5.0.0, the Oracle CDC Client origin requires additional Oracle permissions to ensure appropriate handling of self-recovery, failover, and crash recovery.

After you upgrade to version 5.0.0 or later, use the following GRANT statements to update the Oracle user account associated with the origin:
GRANT select on GV_$ARCHIVED_LOG to <user name>;
GRANT select on GV_$INSTANCE to <user name>;
GRANT select on GV_$LOG to <user name>; 
GRANT select on V_$INSTANCE to <user name>;

Review Couchbase Pipelines

Starting with version 4.4.0, the Couchbase stage library no longer includes an encryption JAR file that the Couchbase stages do not directly use. Removing the JAR file should not affect pipelines using Couchbase stages.

However, if Couchbase pipelines display errors about classes or methods not being found, you can install the following encryption JAR file as an external library for the Couchbase stage library:

https://search.maven.org/artifact/com.couchbase.client/encryption/1.0.0/jar

To install an external library, see Install External Libraries.

Update Keystore Location

Starting with version 4.2.0, when you enable HTTPS for Data Collector, you can store the keystore file in the Data Collector resources directory, $SDC_RESOURCES. You can then enter a path relative to that directory when you define the keystore location in the Data Collector configuration file.

In previous releases, you can store the keystore file in the Data Collector configuration directory, $SDC_CONF, and then define the location to the file using a path relative to that directory. You can continue to store the file in the configuration directory, but StreamSets recommends moving it to the resources directory when you upgrade.

Review Tableau CRM Pipelines

Starting with version 4.2.0, the Tableau CRM destination, previously known as the Einstein Analytics destination, writes to Salesforce differently from versions 3.7.0 - 4.1.x. When upgrading from version 3.7.0 - 4.1.x, review Tableau CRM pipelines to ensure that the destination behaves appropriately. When upgrading from a version prior to 3.7.0, no action is needed.

With version 4.2.0 and later, the destination writes to Salesforce by uploading batches of data to Salesforce, then signaling Salesforce to process the dataset after a configurable interval when no new data arrives. You configure the interval with the Dataset Wait Time stage property.

In versions 3.7.0 - 4.1.x, the destination signals Salesforce to process data after uploading each batch, effectively treating each batch as a dataset and making the Dataset Wait Time property irrelevant.

After upgrading from version 3.7.0 - 4.1.x to version 4.2.0 or later, verify that the destination behavior is as expected. If necessary, update the Dataset Wait Time property to indicate the interval that Salesforce should wait before processing each dataset.

Resolve Kafka and MapR Streams Conflicts

Starting with version 4.0.0, Kafka stages and MapR Streams stages generate an error when you specify an additional Kafka or MapR configuration property that conflicts with a stage property setting.

In the stage properties, you can use the Override Stage Configurations property to enable user-defined Kafka or MapR configuration properties to take precedence. Or, you can remove or update the configuration property to allow the stage property to take precedence.

Review HTTP Client Processor Pipelines

Starting with version 4.0.0, the HTTP Processor performs additional checks against the specified Batch Wait Time property. In certain cases, this change can generate errors. After upgrading from version 3.x to version 4.0.0 or later, verify that pipelines that include the HTTP Client processor perform as expected.

The Batch Wait Time property defines the maximum amount of time that the processor uses to process all HTTP requests for a single record. When the processing for a record exceeds the specified batch wait time, the output records are passed to the stage for error handling.

In previous releases, the HTTP Client processor only checked the batch wait time before each HTTP request. As a result, the processor did not always notice when the processing time exceeded the batch wait time.

Starting with version 4.0.0, the HTTP Client processor checks the batch wait time before and after every request. As a result, the processor may generate more errors than in previous releases.

Also, in previous releases, the default value for Batch Wait Time was 2,000 milliseconds. Starting with version 4.0.0, the default value is 100,000 milliseconds.

Important: When you upgrade from version 3.x to version 4.0.0 or later, the Batch Wait Time property in the HTTP Client processor is set to the new default of 100,000 milliseconds, unless you changed the property from the default.

For example, if you did not touch the Batch Wait Time property in a 3.x pipeline, then it is increased from 2,000 to 100,000 milliseconds during the upgrade. However, if you set the property to 3000 milliseconds in a 3.x pipeline, then the processor retains the 3000 millisecond batch wait time after the upgrade.

After upgrading from version 3.x to version 4.0.0 or later, verify that pipelines that include the HTTP Client processor perform as expected. If you want the processor to wait for all HTTP requests to complete, increase the Batch Wait Time as needed.

Verify Elasticsearch Security

Starting with version 3.21.0, Elasticsearch stages include additional security validation. As a result, pipelines with Elasticsearch security issues that previously ran without error might fail to start after you upgrade to version 3.21.0 or later.

When this occurs, check for additional details in the error messages, then correct the security issue or stage configuration, as needed.

For example, in earlier Data Collector versions, an Elasticsearch stage configured to use the AWS Signature V4 security mode with SSL/TLS would not generate an error if the certificate was missing from the specified truststore. With version 3.21.0 or later, the pipeline fails to start with the following error:
ELASTICSEARCH_43 - Could not connect to the server(s) <SSL/TLS error details>
As another example, in earlier versions, if you specify a port in an HTTP URL that doesn’t support the HTTPS protocol when configuring an Elasticsearch stage to use SSL/TLS, the stage used HTTP without raising an error. With version 3.21.0 or later, the pipeline fails to start with an error such as:
ELASTICSEARCH_43 - Could not connect to the server(s). 
Unrecognized SSL message, plaintext connection?

Note that the details of the message vary based on the originating server.

Adjust PostgreSQL CDC Pipelines or PostgreSQL Configuration

Starting with version 3.21.0, the PostgreSQL CDC Client origin includes a new Status Interval property that helps ensure that the wal2json logical decoder, which helps process changes, does not time out.

The new Status Interval origin property should be set to less than the wal_sender_timeout property in the PostgreSQL postgresql.conf file. Ideally, the Status Interval property should be half of the value configured for the wal_sender_timeout property.

By default, the Status Interval property is 30 seconds. The wal2json README.md file previously recommended setting the wal_sender_timeout property to 2000 milliseconds, or 2 seconds. If you use these values for both properties, the pipeline can trigger the following error:

com.streamsets.pipeline.api.StageException: JDBC_606 - Wal Sender is not active

To avoid this issue, update one of the properties so that Status Interval is half of wal_sender_timeout.

When possible, use the default Status Interval value and the default wal_sender_timeout value of 60000 milliseconds, or 60 seconds.

Review Processing of MySQL Data (JDBC Processors)

Starting with version 3.21.0, JDBC processors convert MySQL unsigned integer data types to different Data Collector types than in earlier Data Collector versions. This change occurred for JDBC origins in an earlier version.

When you upgrade to version 3.21.0 or later, review pipelines that use JDBC processors to work with MySQL database data to ensure that downstream expressions provide the expected results.

The following table describes the data type conversion changes:
MySQL Data Type Data Type Conversion Before 3.21.0 Data Type Conversion with 3.21.0 and Later
Bigint Unsigned Long Decimal
Int Unsigned Integer Long
Mediumint Unsigned Integer Long
Smallint Unsigned Short Short

Review Google Pub/Sub Producer Pipelines

Starting with version 3.20.0, the Google Pub/Sub Producer destination requires specifying a positive integer value for the Max Outstanding Message Count and Max Outstanding Request Bytes properties.

In earlier Data Collector versions, you could set these properties to 0 to opt out of using them. With version 3.20.0 and later, these properties must be set to a positive integer.

Upgraded pipelines with these properties set to positive integers retain the configured values. Upgraded pipelines with these properties set to 0 are updated to use the new default values, as follows:
  • Max Outstanding Message Count is set to 1000 messages
  • Max Outstanding Request Bytes is set to 8000 bytes

If upgraded pipelines previously used 0 to opt out of using these properties, review the pipelines to ensure that the new default values are appropriate. Update the properties as needed.

Review JDBC Multitable Consumer Pipelines

Starting with version 3.20.0, the JDBC Multitable Consumer origin behavior while performing multithreaded processing with the Switch Tables batch strategy has changed. This affects multithreaded table and partition processing in a similar manner:
Multithreaded table processing
In earlier Data Collector versions, when you use the Switch Tables batch strategy with multithreaded table processing, multiple threads can take turns processing data within a single table, caching separate result sets for the table.
With this release, each table can have only a single result set cached at a time.
So while a thread switches tables between batches, it now skips tables that already have a result set from another thread. Only one thread can process the data in a table at a time.
Multithreaded partition processing
Similarly, in earlier Data Collector versions, when you use the Switch Tables batch strategy with multithreaded partition processing, multiple threads can take turns processing data within a single partition, caching separate result sets for the partition.
With this release, each partition can have only a single result set cached at a time.
So while a thread switches partitions between batches, it now skips partitions that already have a result set from another thread. Only one thread can process the data in a partition at a time.

Review upgraded pipelines that use the Switch Table batch strategy. Depending on factors such as the number and size of the tables and partitions being processed, the change might negatively impact performance.

For example, say two threads process four tables in multithreaded table processing, and one table is much larger than the other tables. In earlier versions, using the Switch Tables batch strategy allowed multiple threads to help process the large table. With version 3.20.0 or later, only one thread can process data in one table at a time.

If pipeline performance has been negatively impacted, consider the following options:
  • If multithreaded table processing has slowed, you may have a mix of small and large tables.

    To enable large tables to be processed by more than one thread, consider using multithreaded partition processing for that table.

    To enable threads to cycle through the tables more quickly, you might reduce the number of batches generated from a result set using the Batches from Result Set property.

  • If multithreaded partition processing has slowed, you may have a mix of small and large partitions.

    To enable threads to cycle through the partitions more quickly, you might reduce the number of batches generated from a result set using the Batches from Result Set property.

For information about batch strategies, see Batch Strategy.

Update Oracle CDC Client Pipelines

Consider the following upgrade tasks for pipelines that contain the Oracle CDC Client origin, based on the version that you are upgrading from:

Upgrade from versions earlier than 3.19.0
Starting with version 3.19.0, Oracle CDC Client origins with the Parse SQL property enabled no longer generate records for SELECT_FOR_UPDATE operations.
If your Oracle CDC Client pipelines do not process SELECT_FOR_UPDATE operations or do not need to process SELECT_FOR_UPDATE operations, no changes are required.
If you want to capture SELECT_FOR_UPDATE statements, you can clear the Parse SQL Query property to write LogMiner SQL statements to generated records. Then, specify SELECT_FOR_UPDATE in the Operations property.
Upgrade from versions earlier than 3.7.0
Starting with version 3.7.0, pipelines that use the Oracle CDC Client origin can produce some duplicate data.
Due to a change in offset format, when the pipeline restarts, the Oracle CDC Client origin reprocesses all transactions with the commit SCN from the last offset to prevent skipping unread records. This issue occurs only for the last SCN that was processed before the upgrade, and only once, upon upgrading to Data Collector version 3.7.0 or later.
When possible, remove the duplicate records from the destination system.

Update Cluster EMR Batch Pipelines

Starting with version 3.19.0, cluster EMR batch pipelines that provision a cluster store the specified EMR version differently than in earlier versions. As a result, the EMR versions defined in earlier pipelines are not retained.

When you upgrade from a version earlier than 3.19.0, you must edit any cluster EMR batch pipeline that provisions a cluster, and define the EMR Version property.

Review Processing of MySQL Data (JDBC Origins)

Starting with version 3.17.0, JDBC origins convert MySQL unsigned integer data types to different Data Collector types than in earlier Data Collector versions.

When you upgrade to version 3.17.0 or later, review pipelines that use JDBC origins to process MySQL database data to ensure that downstream expressions provide the expected results.

The following table describes the data type conversion changes:
MySQL Data Type Data Type Conversion Before 3.17.0 Data Type Conversion with 3.17.0 and Later
Bigint Unsigned Long Decimal
Int Unsigned Integer Long
Mediumint Unsigned Integer Long
Smallint Unsigned Short Short

Update Elasticsearch Security Properties (Optional)

Starting with version 3.17.0, Elasticsearch stages provide a User Name property and a Password property. Elasticsearch stages in previous versions pass the credentials together in a single Security Username/Password property.

When you upgrade to version 3.17.0 or later, any configuration in the Security Username/Password properties is moved to the new User Name property, where the Security Username/Password format, <username>:<password>, remains valid.

Though not required, you can update Elasticsearch stages to use the new User Name and Password properties.

Update Syslog Pipelines

Starting with version 3.9.0, the Syslog destination no longer includes the following properties on the Message tab:
  • Use Non-Text Message Format
  • Message Text

You now configure the destination to use the Text data format on the Data Format tab, just as you do with other destinations.

If pipelines created in a previous version include the Syslog destination configured to use text data, you must configure the Text data format properties on the Data Format tab after the upgrade.

JDBC Tee and JDBC Producer Cache Change

Starting with version 3.9.0, the JDBC Tee processor and the JDBC Producer destination no longer cache prepared statements when performing single-row operations. As a result, the Max Cache Size Per Batch property has been removed from both stages.

In previous versions when you enabled the stage to perform single-row operations, you could configure the Max Cache Size Per Batch property to specify the maximum number of prepared statements to store in the cache.

Pipeline Export

Starting with version 3.8.0, Data Collector has changed the behavior of the pipeline Export option. Data Collector now strips all plain text credentials from exported pipelines. Previously, Data Collector included plain text credentials in exported pipelines.

To use the previous behavior and include credentials in the export, choose the new Export with Plain Text Credentials option when exporting a pipeline.

Update TCP Server Pipelines

Starting with version 3.7.2, the TCP Server origin has changed the valid values for the Read Timeout property. The property now allows a minimum of 1 second and a maximum of 3,600 seconds.

In previous versions, the Read Timeout property had no maximum value and could be set to 0 to keep the connection open regardless of whether the origin read any data.

If pipelines created in a previous version have the Read Timeout property set to a value less than 1 or greater than 3,600, the upgrade process sets the property to the maximum value of 3,600 seconds. If necessary, update the Read Timeout property as needed after the upgrade.

Update Cluster Pipelines

Starting with version 3.7.0, Data Collector now requires that the Java temporary directory on the gateway node in the cluster is writable.

The Java temporary directory is specified by the Java system property java.io.tmpdir. On UNIX, the default value of this property is typically /tmp and is writable.

Previous Data Collector versions did not have this requirement. Before running upgraded cluster pipelines, verify that the Java temporary directory on the gateway node is writable.

Update Kafka Consumer or Kafka Multitopic Consumer Pipelines

Starting with version 3.7.0, Data Collector no longer uses the auto.offset.reset value set in the Kafka Configuration property to determine the initial offset for the Kafka Consumer or Kafka Multitopic Consumer origin. Instead, Data Collector uses the new Auto Offset Reset property to determine the initial offset. With the default setting of the new property, the origin reads all existing messages in a topic. In previous versions, the origin read only new messages by default. Because running a pipeline sets an offset value, configuration of the initial offset only affects pipelines that have never run.

After upgrading from a version earlier than 3.7.0, update any pipelines that have not run and use the Kafka Consumer or Kafka Multitopic Consumer origins.
  1. On the Kafka tab for the origin, set the value of the Auto Offset Reset property:
    • Earliest - Select to have the origin read messages starting with the first message in the topic (same behavior as configuring auto.offset.reset to earliest in previous versions of Data Collector).
    • Latest - Select to have the origin read messages starting with the last message in the topic (same behavior as not configuring auto.offset.reset in previous versions of Data Collector).
    • Timestamp - Select to have the origin read messages starting with messages at a particular timestamp, which you specify in the Auto Offset Reset Timestamp property.
  2. If configured in the Kafka Configuration property, delete the auto.offset.reset property.

Update JDBC Pipelines

Starting with version 3.5.0, Data Collector requires the maximum lifetime for a connection to be at least 30 minutes in stages that use a JDBC connection. Data Collector does not validate stages with lower non-zero values configured.

If you upgrade pipelines that include a stage that uses a JDBC connection, update the stage to set the maximum lifetime for a connection to be at least 30 minutes.

On the Advanced tab, set the Max Connection Lifetime property to be at least 30 minutes or 1800 seconds.

Update Spark Executor with Databricks Pipelines

Starting with version 3.5.0, Data Collector introduces a new Databricks Job Launcher executor and has removed the ability to use the Spark executor with Databricks.

If you upgrade pipelines that include the Spark executor with Databricks, you must update the pipeline to use the Databricks Job Launcher executor after you upgrade.

Update Pipelines to Use Spark 2.1 or Later

Starting with version 3.3.0, Data Collector removes support for Spark 1.x and introduces cluster streaming mode with support for Kafka security features such as SSL/TLS and Kerberos authentication using Spark 2.1 or later and Kafka 0.10.0.0 or later. For more information about these changes, see Upgrade to Spark 2.1 or Later.

After upgrading the Cloudera CDH distribution, Hortonworks Hadoop distribution, or Kafka system to the required version and then upgrading Data Collector, you must update pipelines to use Spark 2.1 or later. Pipelines that use the earlier systems will not run until you perform these tasks:

  1. Install the stage library for the upgraded system.
  2. In the pipeline, edit the stage and select the appropriate stage library.
  3. If the pipeline includes a Spark Evaluator processor and the Spark application was previously built with Spark 2.0 or earlier, rebuild it with Spark 2.1.

    Or if you used Scala to write the custom Spark class, and the application was compiled with Scala 2.10, recompile it with Scala 2.11.

  4. If the pipeline includes a Spark executor and the Spark application was previously built with Spark 2.0 or earlier, rebuild it with Spark 2.1 and Scala 2.11.

Update Value Replacer Pipelines

Starting with version 3.1.0.0, Data Collector introduces a new Field Replacer processor and has deprecated the Value Replacer processor.

The Field Replacer processor lets you define more complex conditions to replace values. For example, unlike the Value Replacer, the Field Replacer can replace values that fall within a specified range.

You can continue to use the deprecated Value Replacer processor in pipelines. However, the processor will be removed in a future release - so we recommend that you update pipelines to use the Field Replacer as soon as possible.

To update your pipelines, replace the Value Replacer processor with the Field Replacer processor. The Field Replacer replaces values in fields with nulls or with new values. In the Field Replacer, use field path expressions to replace values based on a condition.

For example, let's say that your Value Replacer processor is configured to replace null values in the product_id field with "NA" and to replace the "0289" store ID with "0132" as follows:

In the Field Replacer processor, you can configure the same replacements using field path expressions as follows:

Update Tableau CRM Pipelines

Starting with version 3.1.0.0, the Tableau CRM destination, previously known as the Einstein Analytics destination, introduces a new append operation that lets you combine data into a single dataset. Configuring the destination to use dataflows to combine data into a single dataset has been deprecated.

You can continue to configure the destination to use dataflows. However, dataflows will be removed in a future release - so we recommend that you update pipelines to use the append operation as soon as possible.

Disable Cloudera Navigator Integration

Starting with version 3.0.0.0, the beta version of Cloudera Navigator integration is no longer available with Data Collector. Cloudera Navigator integration now requires a paid subscription. For more information about purchasing Cloudera Navigator integration, contact StreamSets.

When upgrading from a Data Collector version with Cloudera Navigator integration enabled to version 3.0.0.0 without a paid subscription, perform the following post-upgrade task:

Do not include the Cloudera Navigator properties when you configure the 3.0.0.0 Data Collector configuration file, sdc.properties. The properties to omit are:
  • lineage.publishers
  • lineage.publisher.navigator.def
  • All other properties with the lineage.publisher.navigator prefix

JDBC Multitable Consumer Query Interval Change

Starting with version 3.0.0.0, the Query Interval property is replaced by the new Queries per Second property.

Upgraded pipelines with the Query Interval specified using a constant or the default format and unit of time, ${10 * SECONDS}, have the new Queries per Second property calculated and defined as follows:
Queries per Second = Number of Threads / Query Interval (in seconds)
For example, say the origin uses three threads and Query Interval is configured for ${15 * SECONDS}. Then, the upgraded origin sets Queries per Seconds to 3 divided by 15, which is .2. This means the origin will run a maximum of two queries every 10 seconds.

The upgrade would occur the same way if Query Interval were set to 15.

Pipelines with a Query Interval configured to use other units of time, such as ${.1 *MINUTES}, or configured with a different expression format, such as ${SECONDS * 5}, are upgraded to use the default for Queries per Second, which is 10. This means the pipeline will run a maximum of 10 queries per second. The fact that these expressions are not upgraded correctly is noted in the Data Collector log.

If necessary, update the Queries per Second property as needed after the upgrade.

Update JDBC Query Consumer Pipelines used for SQL Server CDC Data

Starting with version 3.0.0.0, the Microsoft SQL Server CDC functionality in the JDBC Query Consumer origin has been deprecated and will be removed in a future release.

For pipelines that use the JDBC Query Consumer to process Microsoft SQL Server CDC data, replace the JDBC Query Consumer origin with another origin:

Update MongoDB Destination Upsert Pipelines

Starting with version 3.0.0.0, the MongoDB destination supports the replace and update operation codes, and no longer supports the upsert operation code. You can use a new Upsert flag in conjunction with Replace and Update.

After upgrading from a version earlier than 3.0.0.0, update the pipeline as needed to ensure that records passed to the destination do not use the upsert operation code (sdc.operation.type = 4). Records that use the upsert operation code will be sent to error.

In previous releases, records flagged for upsert were treated in the MongoDB system as Replace operations with the Upsert flag set.

If you want to replicate the upsert behavior from earlier releases, perform the following steps:
  1. Configure the pipeline to use the Replace operation code.

    Make sure that the sdc.operation.type is set to 7 for Replace instead of 4 for Upsert.

  2. In the MongoDB destination, enable the new Upsert property.

Time Zones in Stages

Starting with version 3.0.0.0, time zones have been organized and updated to use JDK 8 names. This should make it easier to select time zones in stage properties.

In the rare case that an upgraded pipeline uses a format not supported by JDK 8, edit the pipeline to select a compatible time zone.

Update Kudu Pipelines

Consider the following upgrade tasks for Kudu pipelines, based on the version that you are upgrading from:

Upgrade from versions earlier than 3.0.0.0
Starting with version 3.0.0.0, if the destination receives a change data capture log from the following source systems, you must specify the source system so that the destination can determine the format of the log: Microsoft SQL Server, Oracle CDC Client, MySQL Binary Log, or MongoDB Oplog.
Previously, the Kudu destination could not directly receive changed data from these source systems. You either had to include a scripting processor in the pipeline to modify the field paths in the record to a format that the destination could read. Or, you had to add multiple Kudu destinations to the pipeline - one per operation type - and include a Stream Selector processor to send records to the destination by operation type.
If you implemented one of these workarounds, then after upgrading, modify the pipeline to remove the scripting processor or the Stream Selector processor and the multiple destinations. In the Kudu destination, set the Change Log Format to the appropriate format of the log: Microsoft SQL Server, Oracle CDC Client, MySQL Binary Log, or MongoDB Oplog.
Upgrade from versions earlier than 2.2.0.0
Starting with version 2.2.0.0, Data Collector provides support for Apache Kudu version 1.0.x and no longer supports earlier Kudu versions. To upgrade pipelines that contain a Kudu destination from Data Collector versions earlier than 2.2.0.0, upgrade your Kudu cluster and then add a stage alias for the earlier Kudu version to the Data Collector configuration file, $SDC_CONF/sdc.properties.

The configuration file includes stage aliases to enable backward compatibility for pipelines created with earlier versions of Data Collector.

To update Kudu pipelines:

  1. Upgrade your Kudu cluster to version 1.0.x.

    For instructions, see the Apache Kudu documentation.

  2. Open the $SDC_CONF/sdc.properties file and locate the following comment:
    # Stage aliases for mapping to keep backward compatibility on pipelines when stages move libraries
  3. Below the comment, add a stage alias for the earlier Kudu version as follows:
    stage.alias.streamsets-datacollector-apache-kudu-<version>-lib, com_streamsets_pipeline_stage_destination_kudu_KuduDTarget = streamsets-datacollector-apache-kudu_1_0-lib, com_streamsets_pipeline_stage_destination_kudu_KuduDTarget
    Where <version> is the earlier Kudu version: 0_7, 0_8, or 0_9. For example, if you previously used Kudu version 0.9, add the following stage alias:
    stage.alias.streamsets-datacollector-apache-kudu-0_9-lib, com_streamsets_pipeline_stage_destination_kudu_KuduDTarget = streamsets-datacollector-apache-kudu_1_0-lib, com_streamsets_pipeline_stage_destination_kudu_KuduDTarget
  4. Restart Data Collector to enable the changes.

Update JDBC Multitable Consumer Pipelines

Starting with version 2.7.1.1, the JDBC Multitable Consumer origin can now read from views in addition to tables. The origin now reads from all tables and all views that are included in the defined table configurations.

When upgrading pipelines that contain a JDBC Multitable Consumer origin from Data Collector versions earlier than 2.7.1.1, review the table configurations to determine if any views are included. If a table configuration includes views that you do not want to read, simply exclude them from the configuration.

Update Vault Pipelines

Starting with version 2.7.0.0, Data Collector introduces a credential store API and credential expression language functions to access Hashicorp Vault secrets.

In addition, the Data Collector Vault integration now relies on Vault's App Role authentication backend.

Previously, Data Collector used Vault functions to access Vault secrets and relied on Vault's App ID authentication backend. StreamSets has deprecated the Vault functions, and Hashicorp has deprecated the App ID authentication backend.

After upgrading, update pipelines that use Vault functions in one of the following ways:

Use the new credential store expression language functions (recommended)
To use the new credential functions, install the Vault credential store stage library and define the configuration properties used to connect to Vault. Then, update each upgraded pipeline that includes stages using Vault functions to use the new credential functions to retrieve the credential values.
For details on using the Vault credential store system, see Hashicorp Vault.
Continue to use the deprecated Vault functions
You can continue to use the deprecated Vault functions in pipelines. However, the functions will be removed in a future release - so we recommend that you use the credential functions as soon as possible.
To continue to use the Vault functions, make the following changes after upgrading:
  • Uncomment the single Vault EL property in the $SDC_CONF/vault.properties file.
  • The remaining Vault configuration properties have been moved to the $SDC_CONF/credential-stores.properties file. The properties use the same name, with an added "credentialStore.vault.config" prefix. Copy any values that you customized in the previous vault.properties file into the same property names in the credential-stores.properties file.
  • Define the Vault Role ID and Secret ID that Data Collector uses to authenticate with Vault in the credential-stores.properties file. Defining an App ID for Data Collector is deprecated and will be removed in a future release.
For details on using the Vault functions, see Accessing Hashicorp Vault Secrets with Vault Functions (deprecated).

Configure JDBC Producer Schema Names

Starting with Data Collector version 2.5.0.0, you can use a Schema Name property to specify the database or schema name. In previous releases, you specified the database or schema name in the Table Name property.

Upgrading from a previous release does not require changing any existing configuration at this time. But we recommend using the new Schema Name property, since the ability to specify a database or schema name with the table name might be deprecated in the future.

Evaluate Precondition Error Handling

Starting with Data Collector version 2.5.0.0, precondition error handling has changed.

The Precondition stage property allows you to define conditions that must be met for a record to enter the stage. Previously, records that did not meet all specified preconditions were passed to the pipeline for error handling. That is, the records were processed based on the Error Records pipeline property.

With version 2.5.0.0, records that do not meet the specified preconditions are handled by the error handling configured for the stage. Stage error handling occurs based on the On Record Error property on the General tab of the stage.

Review pipelines that use preconditions to verify that this change does not adversely affect the behavior of the pipelines.

Authentication for Docker Image

Starting with Data Collector version 2.4.1.0, the Docker image now uses the form type of file-based authentication by default. As a result, you must use a Data Collector user account to log in to the Data Collector. If you haven't set up custom user accounts, you can use the admin account shipped with the Data Collector. The default login is: admin / admin.

Earlier versions of the Docker image used no authentication.

Configure Pipeline Permissions

Data Collector version 2.4.0.0 is designed for multitenancy and enables you to share and grant permissions on pipelines. Permissions determine the access level that users and groups have on pipelines.

In earlier versions of Data Collector without pipeline permissions, pipeline access is determined by roles. For example, any user with the Creator role could edit any pipeline.

In version 2.4.0.0, roles are augmented with pipeline permissions. In addition to having the necessary role, users must also have the appropriate permissions to perform pipeline tasks.

For example, to edit a pipeline in 2.4.0.0, a user with the Creator role must also have read and write permission on the pipeline. Without write permission, the user cannot edit the pipeline. Without read permission, the user cannot see the pipeline at all. It does not display in the list of available pipelines.

Note: With pipeline permissions enabled, all upgraded pipelines are initially visible only to users with the Admin role and the pipeline owner - the user who created the pipeline. To enable other users to work with pipelines, have an Admin user configure the appropriate permissions for each pipeline.

In Data Collector version 2.5.0.0, pipeline permissions are disabled by default. To enable pipeline permissions, set the pipeline.access.control.enabled property to true in the Data Collector configuration file.

Tip: You can configure pipeline permissions when permissions are disabled. Then, you can enable the pipeline permissions property after pipeline permissions are properly configured.

For more information about roles and permissions, see Roles and Permissions. For details about configuring pipeline permissions, see Sharing Pipelines.

Update Elasticsearch Pipelines

Data Collector version 2.3.0.0 includes an enhanced Elasticsearch destination that uses the Elasticsearch HTTP API. To upgrade pipelines that use the Elasticsearch destination from Data Collector versions earlier than 2.3.0.0, you must review the value of the Default Operation property.

Review all upgraded Elasticsearch destinations to ensure that the Default Operation property is set to the correct operation. Upgraded Elasticsearch destinations have the Default Operation property set based on the configuration for the Enable Upsert property:

  • With upsert enabled, the default operation is set to INDEX.
  • With upsert not enabled, the default operation is set to CREATE which requires a DocumentId.
Note: The Elasticsearch version 5 stage library is compatible with all versions of Elasticsearch. Earlier stage library versions have been removed.