Post Upgrade Tasks

After you upgrade Data Collector, complete the following tasks, as needed.

Review the Maximum Batch Vault Size for Oracle CDC Origin Pipelines

Starting with version 5.9.0, the Oracle CDC origin has a Max Batch Vault Size property that allows you to configure the maximum number of batches the origin pre-generates while the pipeline is processing other batches. In upgraded pipelines, the origin uses the default maximum batch vault size of 64.

After you upgrade to version 5.9.0, review Oracle CDC origin pipelines. If the maximum batch vault size is not appropriate, update the pipelines accordingly.

Review Amazon, Azure, Data Parser, JMS Consumer, and Pulsar Consumer Origin Pipelines

Starting with version 5.9.0, the following origins no longer read tables that contain multiple columns with the same name:
  • Amazon S3

  • Amazon SQS Consumer

  • Azure Blob Storage

  • Azure Data Lake Storage Gen2

  • Data Parser

  • JMS Consumer

  • Pulsar Consumer

  • Pulsar Consumer (Legacy)

When configured to read tables that contain duplicate column names, the origin treats the tables as invalid and generates an error.

After you upgrade to version 5.9.0, review pipelines that use these origins. If any pipelines require the ability to read tables containing multiple columns with the same name, configure the origins to ignore column headers.

Review Oracle Bulkload Origin Pipelines

Starting with version 5.8.0, pipelines using the Oracle Bulkload origin no longer fail when the origin encounters an empty table. This change might cause Oracle Bulkload pipelines created with earlier versions of Data Collector to behave in unexpected ways.

After you upgrade to version 5.8.0, review any pipelines that use the Oracle Bulkload origin to ensure they behave as expected.

Update stages that were using Enterprise stage libraries

Starting with version 5.8.0, Data Collector no longer supports Enterprise stage libraries.

After you upgrade to 5.8.0, update stages using any of the following Enterprise stage libraries by installing the stage library as a custom stage library:
  • Protector
  • Microsoft SQL Server 2019 Big Data Cluster

Grant Users View Access for the Oracle CDC Origin

Starting with version 5.7.0, the Oracle CDC origin must use a user account with access to the all_tab_cols view.

After you upgrade to version 5.7.0 or later, run the following command in Oracle to grant the user account access to the view:
grant select on all_tab_cols to <user name>;

For CDB databases, run the command from the root container, cdb$root. Then run it again from the pluggable database. For non-CDB databases, run the command from the primary database.

Review Amazon S3, Azure Blob Storage, and Azure Data Lake Storage Gen2 Origin Pipelines

Starting with version 5.7.0, Amazon S3, Azure Blob Storage, and Azure Data Lake Storage Gen2 origins have a File Processing Delay property that allows you to configure the minimum number of milliseconds that must pass from the time a file is created before it is processed. In upgraded pipelines these origins receive the default file processing delay of 10,000 milliseconds.

After you upgrade to version 5.7.0 or later, review pipelines that include the Amazon S3, Azure Blob Storage, and Azure Data Lake Storage Gen2 origins. If the 10,000 millisecond delay is not appropriate, update the pipelines accordingly.

Review Amazon S3 and Databricks Delta Lake Stages

Starting with version 5.6.0, you can no longer include the forward slash (/) in the following properties due to an Amazon Web Services (AWS) SDK upgrade:
  • Bucket property for the Amazon S3 origin
  • Bucket and path property for the Amazon S3 destination and executor
  • Bucket property for the Databricks Delta Lake destination when staging files to Amazon S3

For more information about this change, see the aws-sdk-java list of Amazon S3 bug fixes.

As a result, you can define only the bucket name in these bucket properties. Use the following properties for each stage to define the path to an object inside the bucket:
  • Amazon S3 origin - Common Prefix and Prefix Pattern properties
  • Amazon S3 destination - Common Prefix and Partition Prefix properties
  • Amazon S3 executor - Object property on the Tasks tab
  • Databricks Delta Lake destination - Stage File Prefix property on the Staging tab

After you upgrade to version 5.6.0 or later, review the bucket property in these stages to ensure that the property defines the bucket name only. Modify the properties as needed to define only the bucket name in the bucket property and to define the path in the remaining properties.

For example, if an Amazon S3 origin configured in an earlier Data Collector version defines the properties as follows:
  • Bucket: orders/US/West
  • Common Prefix:
  • Prefix Pattern: **/*.log
Update the properties as follows:
  • Bucket: orders
  • Common Prefix: US/West/
  • Prefix Pattern: **/*.log

Install the Databricks Stage Library

Starting with version 5.6.0, the Databricks Delta Lake destination, Databricks Query executor, and Databricks Delta Lake connection require the Databricks stage library. In previous releases, they required the Databricks Enterprise stage library.

After you upgrade to version 5.6.0 or later, install the Databricks stage library, streamsets-datacollector-sdc-databricks-lib, to enable pipelines and jobs that use these Databricks stages or the Databricks connection to run as expected.

Review Databricks Stages

Starting with version 5.6.0, the scheme of the URL or connection string for the Databricks Delta Lake destination and Databricks Query executor is jdbc:databricks rather than jdbc:spark.

After you upgrade to version 5.6.0 or later, review the JDBC URL property in the Databricks Delta Lake destination and the JDBC Connection String property in the Databricks Query executor to ensure that the scheme resolves to jdbc:databricks.
Note: The upgrade process does not update runtime parameters. You must manually change runtime parameters that define the URL or connection string.

Update the Databricks Delta Lake Connection

Starting with version 5.6.0, the scheme of the URL is jdbc:databricks rather than jdbc.spark.

After you update a connection to use a version 5.6.0 or later authoring Data Collector, edit the JDBC URL property to use the jdbc:databricks scheme.

Review Scripts in Jython Stages

Starting with version 5.6.0, Jython stages uses Jython 2.7.3 to process data.

After you upgrade to version 5.6.0 or later, review the scripts used in the Jython Scripting origin and the Jython Evaluator processor to ensure that they process data as expected.

Install the JDBC Oracle Stage Library

Starting with version 5.6.0, the Oracle Bulkload origin requires the JDBC Oracle stage library. In previous releases, the origin required the Oracle Enterprise stage library.

After you upgrade to version 5.6.0 or later, install the JDBC Oracle stage library, streamsets-datacollector-sdc-databricks-lib, to enable pipelines and jobs that use the Oracle Bulkload origin to run as expected.

Grant Users View Access for the Oracle CDC Origin

Starting with version 5.6.0, the Oracle CDC origin requires that the configured database user has access to the v$containers view.

After you upgrade to version 5.6.0 or later, run the following command in Oracle to grant the user account access to the view:
grant select on v$containers to <user name>;

For CDB databases, run the command from the root container, cdb$root. Then run it again from the pluggable database. For non-CDB databases, run the command from the primary database.

Update Origins and Processors that Read Compressed Files

Starting with version 5.6.0, origins that read compressed files require you to set the Compression Library property to properly read files compressed with the Airlift version of Snappy. Destinations compress files with the Airlift version of Snappy. This affects the HTTP Client processor and the following origins:
  • Amazon S3
  • Azure Blob Storage
  • Azure Data Lake Storage Gen1
  • Azure Data Lake Storage Gen2 (Legacy)
  • Azure IoT/Event Hub Consumer
  • CoAP Server
  • Directory
  • File Tail
  • Hadoop FS Standalone
  • Google Cloud Storage
  • Google Pub/Sub Subscriber
  • gRPC Client
  • HTTP Client
  • HTTP Server
  • Kafka Multitopic Consumer
  • MQTT Subscriber
  • REST Service
  • SFTP/FTP/FTPS Client
  • TCP Server
  • WebSocket Client
  • WebSocket Server

After you upgrade to version 5.6.0 or later, review your pipelines. In any origins and processors that read files compressed using the Airlift version of Snappy, including files produced by destinations, set the Compression Library property to Snappy (Airlift Snappy).

Install the Azure stage library

Starting with version 5.5.0, the Azure Synapse SQL destination and Azure Synapse connection require the installation of the Azure stage library. In previous releases, the destination and connection required the Azure Synapse Enterprise stage library.

After you upgrade to version 5.5.0 or later, install the Azure stage library, streamsets-datacollector-azure-lib, so that pipelines and jobs that use the Azure Synapse SQL destination or connection run as expected.

Review Salesforce pipelines

Starting with version 5.5.0, Salesforce stages correctly import date values as dates rather than as strings.

After you upgrade to version 5.5.0 or later, review pipelines with Salesforce stages and ensure that they do not expect dates to be imported as strings.

Review OPC UA Client Pipelines

Starting with version 5.5.0, the OPC UA Client origin no longer includes the Max Array Length or Max String Length properties. These properties were removed because they are redundant. The existing Max Message Size property properly limits the message size regardless of the data type of the message.

After you upgrade to version 5.5.0 or later, review OPC UA Client pipelines to ensure that the configuration for the Max Message Size property is appropriate for the pipeline. The default maximum message size is 2097152.

Install the Snowflake Stage Library to Use Snowflake

Starting with version 5.4.0, using Snowflake stages and Snowflake connections requires installing the Snowflake stage library. In previous releases, Snowflake stages and connections were available with the Snowflake Enterprise stage library.

After you upgrade to 5.4.0 or later, install the Snowflake stage library, streamsets-datacollector-sdc-snowflake-lib, to enable pipelines and jobs that use Snowflake stages or connections to run as expected.

Install the Google Cloud Stage Library to Use BigQuery

Starting with version 5.3.0, using Google BigQuery stages and Google BigQuery connections requires installing the Google Cloud stage library. In previous releases, BigQuery stages and connections were available with the Google BigQuery Enterprise stage library.

After you upgrade to version 5.3.0 or later, install the Google Cloud stage library, streamsets-datacollector-google-cloud-lib, to enable pipelines and jobs using BigQuery stages or connections to run as expected.

Review JDBC Multitable Consumer Pipelines

Starting with version 5.3.0, the Minimum Idle Connections property in the JDBC Multitable Consumer origin cannot be set higher than the Number of Threads property. In previous releases, there was no limit to the number of minimum idle connections that you could configure.

Upgraded pipelines have the Minimum Idle Connections property set to the same value as the Number of Threads property.

After you upgrade to version 5.3.0 or later, review JDBC Multitable Consumer origin pipelines to ensure that the new value for the Minimum Idle Connections property is appropriate for each pipeline.

Review Missing Field Behavior for Field Replacer Processors

Starting with version 5.3.0, the advanced Field Does Not Exist property in the Field Replacer processor has the following two new options that replace the Include without Processing option:
  • Add New Field - Adds the fields defined on the Replace tab to records if they do not exist.
  • Ignore New Field - Ignores any fields defined on the Replace tab if they do not exist.

After you upgrade to version 5.3.0 or later, the Field Does Not Exist property is set to Add New Field. Review Field Replacer pipelines to ensure that this behavior is appropriate.

Review runtime:loadResource Pipelines

Starting with version 5.3.0, pipelines that include the runtime:loadResource function fail with errors when the function calls a missing or empty resource file. In previous releases, those pipelines sometimes continued to run without errors.

After you upgrade to version 5.3.0 or later, review pipelines that use the runtime:loadresource function and ensure that the function calls resource files that include the required information.

Manage Underscores in Snowflake Connection Information

Starting with the Snowflake JDBC driver 3.13.25 release in November 2022, the Snowflake JDBC driver converts underscores to hyphens, by default. This can adversely affect communicating with Snowflake when Snowflake connection information specified in a Snowflake stage or connection, such as a URL, includes underscores.

After you upgrade to Snowflake JDBC driver 3.13.25 or later, review your Snowflake connection information for underscores.

When needed, you can bypass the default driver behavior by setting the allowUnderscoresInHost driver property to true. For more information and alternate solutions, see this Snowflake community article.

Review MySQL Binary Log Pipelines

Starting with version 5.2.0, the MySQL Binary Log origin converts MySQL Enum and Set fields to String fields.

In previous releases, when reading from a database where the binlog_row_metadata MySQL database property is set to MINIMAL, Enum fields are converted to Long, and Set fields are converted to Integer.

In version 5.2.0 as well as previous releases, when the binlog_row_metadata MySQL database property is set to FULL, Enum and Set fields are converted to String.

After you upgrade to version 5.2.0, review MySQL Binary Log pipelines that process Enum and Set data from a database with binlog_row_metadata set to MINIMAL. Update the pipeline as needed to ensure that Enum and Set data is processed as expected.

Review Blob and Clob Processing in Oracle CDC Client Pipelines

Starting with version 5.2.0, the Oracle CDC Client origin has new advanced properties that enable processing Blob and Clob columns. You can use these properties when the origin buffers changes locally. They are disabled by default.

In previous releases, the origin does not process Blob or Clob columns. However, when the Unsupported Fields to Records property is enabled, the origin includes Blob and Clob field names and raw string values.

Due to a known issue with this release, when the origin is not configured to process Blob and Clob columns and when the Unsupported Fields to Records property is enabled, the origin continues to include Blob and Clob field names and raw string values. When the property is disabled, the origin includes Blob and Clob field names with null values. The expected behavior is to always include field names with null values unless the origin is configured to process Blob and Clob columns.

Review Oracle CDC Client pipelines to assess how they should handle Blob and Clob columns:
  • To process Blob and Clob columns, enable Blob and Clob processing on the Advanced tab. You can optionally define a maximum LOB size.

    Verify that sufficient memory is available to Data Collector before enabling Blob and Clob processing.

  • If the origin has the Unsupported Fields to Records property enabled, the origin continues to include Blob and Clob field names and raw string values, as in previous releases.

    If the origin has the Unsupported Fields to Records property disabled, and if null values are acceptable for Blob and Clob fields, then no action is required at this time.

    In a future release, this behavior will change so the Unsupported Fields to Records property has no effect on how Blob and Clob columns are processed.

Review Error Handling for Snowflake CDC Pipelines

In previous releases of the Snowflake Enterprise stage library, when the Snowflake destination runs a MERGE query that fails to write all CDC data in a batch to Snowflake, the Snowflake destination generates a stage error indicating that there was a difference between the number of records expected to be written and the number of records actually written to Snowflake.

The destination does not provide additional detail because Snowflake does not provide information about the individual records that failed to be written when a query fails.

Starting with version 1.12.0 of the Snowflake Enterprise stage library, when a query that writes CDC data fails, in addition to generating the stage error, the Snowflake destination passes all records in the batch to error handling. As a result, the error records are handled based on the error handling configured for the stage and pipeline.

Review stage and pipeline error handling for Snowflake CDC pipelines to ensure that error records are handled appropriately.

Note: The error records passed to error handling have been processed by the Snowflake destination. For example, if the batch includes three records that update the same row, they are merged into a single update record.

Review SQL Server Pipelines with Unencrypted Connections

Starting with version 5.1.0, Data Collector uses Microsoft JDBC Driver for SQL Server version 10.2.1 to connect to Microsoft SQL Server. According to Microsoft, this version has introduced a breaking backward-incompatible change.

As a result, after you upgrade to 5.1.0 or later, upgraded pipelines that connect to Microsoft SQL Server without SSL/TLS encryption will likely fail with a message such as the following:
The driver could not establish a secure connection to SQL Server by using Secure Sockets Layer (SSL) encryption.

This issue can be resolved by configuring SSL/TLS encryption between Microsoft SQL Server and Data Collector. For details on configuring clients for SSL/TLS encryption, see the Microsoft SQL Server documentation.

Otherwise, you can address this issue at a pipeline level by adding encrypt=false to the connection string, or by adding encrypt as an additional JDBC property and setting it to false.

To avoid having to update all affected pipelines immediately, you can configure Data Collector to attempt to disable SSL/TLS for all pipelines that use a JDBC driver. To do so, set the stage.conf_com.streamsets.pipeline.lib.jdbc.disableSSL Data Collector configuration property to true. Note that this property affects all JDBC drivers, and should typically be used only as a stopgap measure. For more information about the configuration property, see Configuring Data Collector.

Review Dockerfiles for Custom Docker Images

Starting with version 5.1.0, the Data Collector Docker image uses Ubuntu 20.04 LTS (Focal Fossa) as a parent image. In previous releases, the Data Collector Docker image used Alpine Linux as a parent image.

If you build custom Data Collector images using streamsets/datacollector version 5.0.0 or earlier as the parent image, review your Dockerfiles and make all required updates to become compatible with Ubuntu Focal Fossa before you build a custom image based on streamsets/datacollector:5.1.0 or later versions.

Review Oracle CDC Client Local Buffer Pipelines

Starting with version 5.1.0, pipelines that include the Oracle CDC Client origin no longer report memory consumption data when the origin uses local buffers. In previous releases, this reporting occurred by default, which slowed pipeline performance.

After you upgrade to Data Collector 5.1.0 or later, memory consumption reporting for Oracle CDC Client local buffer usage is no longer performed by default. If you require this information, you can enable it by setting the stage.conf_com.streamsets.pipeline.stage.origin.jdbc.cdc.oracle.monitorbuffersize Data Collector configuration property to true.

This property enables memory consumption data reporting for all Oracle CDC Client pipelines that use local buffering. Because it slows pipeline performance, as a best practice, enable the property only for short term troubleshooting.

Update Oracle CDC Client Origin User Accounts

Starting with version 5.0.0, the Oracle CDC Client origin requires additional Oracle permissions to ensure appropriate handling of self-recovery, failover, and crash recovery.

After you upgrade to version 5.0.0 or later, use the following GRANT statements to update the Oracle user account associated with the origin:
GRANT select on GV_$ARCHIVED_LOG to <user name>;
GRANT select on GV_$INSTANCE to <user name>;
GRANT select on GV_$LOG to <user name>; 
GRANT select on V_$INSTANCE to <user name>;

Review Couchbase Pipelines

Starting with version 4.4.0, the Couchbase stage library no longer includes an encryption JAR file that the Couchbase stages do not directly use. Removing the JAR file should not affect pipelines using Couchbase stages.

However, if Couchbase pipelines display errors about classes or methods not being found, you can install the following encryption JAR file as an external library for the Couchbase stage library:

https://search.maven.org/artifact/com.couchbase.client/encryption/1.0.0/jar

To install an external library, see Install External Libraries.

Update Keystore Location

Starting with version 4.2.0, when you enable HTTPS for Data Collector, you can store the keystore file in the Data Collector resources directory, <installation_dir>/externalResources/resources. You can then enter a path relative to that directory when you define the keystore location in the Data Collector configuration properties.

In previous releases, you can store the keystore file in the Data Collector configuration directory, <installation_dir>/etc, and then define the location to the file using a path relative to that directory. You can continue to store the file in the configuration directory, but StreamSets recommends moving it to the resources directory when you upgrade.

Review Tableau CRM Pipelines

Starting with version 4.2.0, the Tableau CRM destination, previously known as the Einstein Analytics destination, writes to Salesforce differently from versions 3.7.0 - 4.1.x. When upgrading from version 3.7.0 - 4.1.x, review Tableau CRM pipelines to ensure that the destination behaves appropriately. When upgrading from a version prior to 3.7.0, no action is needed.

With version 4.2.0 and later, the destination writes to Salesforce by uploading batches of data to Salesforce, then signaling Salesforce to process the dataset after a configurable interval when no new data arrives. You configure the interval with the Dataset Wait Time stage property.

In versions 3.7.0 - 4.1.x, the destination signals Salesforce to process data after uploading each batch, effectively treating each batch as a dataset and making the Dataset Wait Time property irrelevant.

After upgrading from version 3.7.0 - 4.1.x to version 4.2.0 or later, verify that the destination behavior is as expected. If necessary, update the Dataset Wait Time property to indicate the interval that Salesforce should wait before processing each dataset.