Post Upgrade Tasks

Review Microsoft Azure Key Vault Credential Store Configuration

In version 6.0.x, Data Collector uses client key authentication for an Azure Key Vault credential store if the credential method property is not configured. Starting with version 6.1.0, Data Collector generates an error if the credential method property is not configured with either client key or managed identity authentication.

After upgrading to version 6.1.x or later from version 6.0.x, review the configuration file for upgraded Azure Key Vault credential stores to ensure the credential method property is properly configured.

Install the New SFTP/FTP/FTPS Stage Library

Starting with Data Collector 6.1.0, SFTP/FTP/FTPS stages have been removed from the basic stage library and added to the new SFTP/FTP/FTPS stage library.

After upgrading to version 6.1.0 or later, if you have upgraded pipelines using an SFTP/FTP/FTPS origin, destination, or executor, you must install the streamsets-datacollector-file-transfer-lib stage library.

Install the New HTTP Stage Library

Starting with Data Collector 6.1.0, HTTP stages have been removed from the basic stage library and added to the new HTTP stage library.

After upgrading to version 6.1.0 or later, install the streamsets-datacollector-http-lib stage library if you have upgraded pipelines using any of the following stages:

HTTP Client origin, processor, and destination
HTTP Router processor
HTTP Server origin

Review EC2 Instance Configuration for AWS Secrest Manager Credential Stores

Starting with Data Collector 6.1.0, AWS Secrets Manager credential stores support Instance Metadata Service version 2. For version 2, Amazon recommends setting the hop limit to 2 in container environments to avoid delays.

If Data Collector runs in a container environment on an Amazon EC2 instance with Instance Metadata Service version 2 and uses instance profile authentication for an AWS Secrets Manager credential store, after upgrading to Data Collector 6.1.0 or later, set the Instance Metadata Service hop limit to 2. For more information, see the Amazon EC2 documentation.

Review Region and Endpoint Configuration for Stages Using Amazon Web Services

Starting with Data Collector 6.1.0, you can configure a region and endpoint for the Assume Role property in stages using AWS. In previous versions, the assumed role uses the same regional endpoint configured for the stage. Upgraded stages configured to assume a role get the region and endpoint for the assumed role from the AWS service it connects to. If those are not available, the assumed role uses the global endpoint for assumed roles.

After upgrading to version 6.1.0, review upgraded stages using AWS to ensure all stages configured to assume a role are configured with the appropriate region and endpoint.

Review Dockerfiles for Custom Docker Images

In previous releases, the Data Collector Docker image used Ubuntu as a parent image. Starting with version 6.0.0, the Data Collector Docker image uses Red Hat Enterprise Linux as a parent image.

After upgrading to version 6.0.0 or later, if you build custom Data Collector images using earlier releases of streamsets/datacollector as the parent image, review your Dockerfiles and make all required updates to become compatible with Red Hat Enterprise Linux before you build a custom Data Collector image.

Note: For information on Red Hat Enterprise Linux version compatibility, see the Dockerfiles file in the Dockerfiles for Data Collector Github directory.

Removed Antenna Doctor

In previous versions of Data Collector, you could configure Antenna Doctor to suggest potential fixes and workarounds to common pipeline issues. Starting with version 6.0, Antenna Doctor is not included in Data Collector.

After upgrading to version 6.0 or later, Antenna Doctor does not send pipeline messages.

Review Azure Blob Storage Origins

In previous versions of Data Collector, the default spooling period for Azure Blob Storage origins was 5 seconds. Starting with version 6.0, the default spooling period is 30 seconds, and upgraded origins are given the new default value of 30 seconds.

After upgrading to version 6.0 or later, review upgraded Azure Blob Storage origins to ensure they are configured with an appropriate spooling period.

Review HashiCorp Vault Credential Stores

In previous versions of Data Collector, the HashiCorp Vault credential store authMethod property could be empty. Starting with version 6.0, the authentication method must be set to one of the following values:

appId
appRole
azure

After upgrading to version 6.0 or later, verify that the authentication method for upgraded credential stores is set to a valid value.

Install the CONNX JDBC driver for upgraded CONNX and CONNX CDC origins

Starting with version 5.12.0, the CONNX JDBC driver is no longer included in the streamsets-datacollector-connx-lib stage library. You must manually install the driver into the stage library before using a CONNX or CONNX CDC origin.

After upgrading to version 5.12.0 or later, if you have upgraded pipelines using a CONNX or CONNX CDC origin, install the CONNX JDBC driver as an external library for the streamsets-datacollector-connx-lib stage library.

Install the Oracle JDBC driver for upgraded Oracle Multitable Consumer origins and Oracle destinations

Starting with version 5.12.0, the Oracle JDBC driver is no longer included in the streamsets-datacollector-jdbc-branded-oracle-lib stage library. You must manually install the driver into the stage library before using an Oracle Multitable Consumer origin or Oracle destination.

After upgrading to version 5.12.0 or later, if you have upgraded pipelines using an Oracle Multitable Consumer origin or Oracle destination, install the Oracle JDBC driver as an external library for the streamsets-datacollector-jdbc-branded-oracle-lib stage library.

Review Pipelines with Google BigQuery or Snowflake Destinations Writing JSON Data

Starting with version 5.11.0, you cannot configure characters to represent null values or newline characters for Google BigQuery or Snowflake destinations when writing JSON data. Upgraded destinations do not change null values or newline characters.

After upgrading to version 5.11.0 or later, review pipelines using Google BigQuery or Snowflake destinations writing JSON data to ensure the destination does not receive any null values or newline characters from the pipeline that should not be passed to the external system.

Review Snowflake File Uploader Staging Details

Starting with version 5.11.0, the Database and Schema properties have been removed from the Snowflake File Uploader destination. In Data Collector 5.10.0, if the Stage Database and Stage Schema properties were not configured, the destination used the database and schema values configured for the destination or Control Hub connection instead.

When upgrading from version 5.10.0 to 5.11.0 or later, Snowflake File Uploader destinations that do not use a Control Hub connection and do not have values configured for the Stage Database or Stage Schema properties are assigned a staging database value equal to the configured database value and a staging schema value equal to the configured schema value. Snowflake File Uploader destinations that use a Control Hub connection and do not have values configured for the Stage Database or Stage Schema properties are not assigned any values for these properties and must have them configured after upgrading.

After upgrading to version 5.11.0 or later from version 5.10.0, review Snowflake File Uploader destinations that use Control Hub connections and make sure the Stage Database and Stage Schema properties are configured.

Review Pipeline Notification Email Configurations

Starting with version 5.11.0, the format of pipeline notification emails has changed to include a new error code format.

After upgrading to version 5.11.0 or later, review notification email configurations for upgraded pipelines to ensure they behave as expected.

Review the Batch Wait Time for Directory Origins

Starting with version 5.11.0, the origin correctly interprets the batch wait time value as seconds. In earlier releases, the origin incorrectly interpreted the value as milliseconds.

After upgrading to version 5.11.0 or later, review Directory origins to ensure they are configured with an appropriate batch wait time.

Review the Oracle CDC Client Record Cache Size

Starting with version 5.10.0, you can configure the maximum size of the record cache for an Oracle CDC Client origin using the Records Cache Size property. Upgraded pipelines are given the default value of -2, which represents two times the batch size.

After upgrading to Data Collector 5.10.0 or later, verify that Oracle CDC Client origins are configured with the appropriate record cache size.

Review Search Mode Behavior for Start Jobs Pipelines

Starting with version 5.10.0, Start Jobs stages have updated search mode options. Pipelines upgraded from version 5.2.x or earlier that were configured with the contain search mode option are updated to use the new contains unique search mode option.

After upgrading to Data Collector 5.10.0 or later from Data Collector 5.2.x or earlier, verify that Start Jobs pipelines are using the appropriate search mode.

Review the Maximum Batch Vault Size for Oracle CDC Origin Pipelines

Starting with version 5.9.0, the Oracle CDC origin has a Max Batch Vault Size property that allows you to configure the maximum number of batches the origin pre-generates while the pipeline is processing other batches. In upgraded pipelines, the origin uses the default maximum batch vault size of 64.

After you upgrade to version 5.9.0 or later, review Oracle CDC origin pipelines. If the maximum batch vault size is not appropriate, update the pipelines accordingly.

Review Amazon, Azure, Data Parser, JMS Consumer, and Pulsar Consumer Origin Pipelines

Starting with version 5.9.0, the following origins no longer read tables that contain multiple columns with the same name:

Amazon S3
Amazon SQS Consumer
Azure Blob Storage
Azure Data Lake Storage Gen2
Data Parser
JMS Consumer
Pulsar Consumer
Pulsar Consumer (Legacy)

When configured to read tables that contain duplicate column names, the origin treats the tables as invalid and generates an error.

After you upgrade to version 5.9.0 or later, review pipelines that use these origins. If any pipelines require the ability to read tables containing multiple columns with the same name, configure the origins to ignore column headers.

Review JDBC Lookup Processor SQL Query Configuration

Starting with version 5.9.0, stability and performance improvements to the JDBC Lookup processor cause the processor to strictly enforce the requirement of a WHERE clause in SQL queries.

After upgrading to version 5.9.0 or later, verify that the SQL Query property for each JDBC Lookup processor is configured with a WHERE clause.

Review Oracle Bulkload Origin Pipelines

Starting with version 5.8.0, pipelines using the Oracle Bulkload origin no longer fail when the origin encounters an empty table. This change might cause Oracle Bulkload pipelines created with earlier versions of Data Collector to behave in unexpected ways.

After you upgrade to version 5.8.0 or later, review any pipelines that use the Oracle Bulkload origin to ensure they behave as expected.

Update stages that were using Enterprise stage libraries

Starting with version 5.8.0, Data Collector no longer supports Enterprise stage libraries.

After you upgrade to 5.8.0 or later, update stages using any of the following Enterprise stage libraries by installing the stage library as a custom stage library:

Protector
Microsoft SQL Server 2019 Big Data Cluster

Grant Users View Access for the Oracle CDC Origin

Starting with version 5.7.0, the Oracle CDC origin must use a user account with access to the all_tab_cols view.

After you upgrade to version 5.7.0 or later, run the following command in Oracle to grant the user account access to the view:

grant select on all_tab_cols to <user name>;

For CDB databases, run the command from the root container, cdb$root. Then run it again from the pluggable database. For non-CDB databases, run the command from the primary database.

Review Amazon S3, Azure Blob Storage, and Azure Data Lake Storage Gen2 Origin Pipelines

Starting with version 5.7.0, Amazon S3, Azure Blob Storage, and Azure Data Lake Storage Gen2 origins have a File Processing Delay property that allows you to configure the minimum number of milliseconds that must pass from the time a file is created before it is processed. In upgraded pipelines these origins receive the default file processing delay of 10,000 milliseconds.

After you upgrade to version 5.7.0 or later, review pipelines that include the Amazon S3, Azure Blob Storage, and Azure Data Lake Storage Gen2 origins. If the 10,000 millisecond delay is not appropriate, update the pipelines accordingly.

Review the Batch Wait Time for ALDS Gen2 (Legacy), Directory, and Hadoop FS Standalone Origins

Versions of Data Collector prior to 5.7.0 incorrectly treated the batch wait time value configured for ALDS Gen2 (Legacy), Directory, and Hadoop FS Standalone origins as milliseconds instead of seconds. Starting with version 5.7.0, Data Collector treats the batch wait time value as seconds, which can increase the wait time for empty batches in upgraded pipelines. Versions of Data Collector prior to 5.7.0 incorrectly treat the batch wait time value configured for ALDS Gen2 (Legacy), Directory, and Hadoop FS Standalone origins as milliseconds instead of seconds.

After upgrading to version 5.7.0 or later, review the batch wait time for ALDS Gen2 (Legacy), Directory, and Hadoop FS Standalone origins in upgraded pipelines, and update the value if necessary.

Review Amazon S3 and Databricks Delta Lake Stages

Starting with version 5.6.0, you can no longer include the forward slash (/) in the following properties due to an Amazon Web Services (AWS) SDK upgrade:

Bucket property for the Amazon S3 origin
Bucket and path property for the Amazon S3 destination and executor
Bucket property for the Databricks Delta Lake destination when staging files to Amazon S3

For more information about this change, see the aws-sdk-java list of Amazon S3 bug fixes.

As a result, you can define only the bucket name in these bucket properties. Use the following properties for each stage to define the path to an object inside the bucket:

Amazon S3 origin - Common Prefix and Prefix Pattern properties
Amazon S3 destination - Common Prefix and Partition Prefix properties
Amazon S3 executor - Object property on the Tasks tab
Databricks Delta Lake destination - Stage File Prefix property on the Staging tab

After you upgrade to version 5.6.0 or later, review the bucket property in these stages to ensure that the property defines the bucket name only. Modify the properties as needed to define only the bucket name in the bucket property and to define the path in the remaining properties.

For example, if an Amazon S3 origin configured in an earlier Data Collector version defines the properties as follows:

Bucket: orders/US/West
Common Prefix:
Prefix Pattern: **/*.log

Update the properties as follows:

Bucket: orders
Common Prefix: US/West/
Prefix Pattern: **/*.log

Install the Databricks Stage Library

Starting with version 5.6.0, the Databricks Delta Lake destination, Databricks Query executor, and Databricks Delta Lake connection require the Databricks stage library. In previous releases, they required the Databricks Enterprise stage library.

After you upgrade to version 5.6.0 or later, install the Databricks stage library, streamsets-datacollector-sdc-databricks-lib, to enable pipelines and jobs that use these Databricks stages or the Databricks connection to run as expected.

Review Databricks Stages

Starting with version 5.6.0, the scheme of the URL or connection string for the Databricks Delta Lake destination and Databricks Query executor is jdbc:databricks rather than jdbc:spark.

After you upgrade to version 5.6.0 or later, review the JDBC URL property in the Databricks Delta Lake destination and the JDBC Connection String property in the Databricks Query executor to ensure that the scheme resolves to jdbc:databricks.

Note: The upgrade process does not update runtime parameters. You must manually change runtime parameters that define the URL or connection string.

Update the Databricks Delta Lake Connection

Starting with version 5.6.0, the scheme of the URL is jdbc:databricks rather than jdbc.spark.

After you update a connection to use a version 5.6.0 or later authoring Data Collector, edit the JDBC URL property to use the jdbc:databricks scheme.

Review Scripts in Jython Stages

Starting with version 5.6.0, Jython stages uses Jython 2.7.3 to process data.

After you upgrade to version 5.6.0 or later, review the scripts used in the Jython Scripting origin and the Jython Evaluator processor to ensure that they process data as expected.

Install the JDBC Oracle Stage Library

Starting with version 5.6.0, the Oracle Bulkload origin requires the JDBC Oracle stage library. In previous releases, the origin required the Oracle Enterprise stage library.

After you upgrade to version 5.6.0 or later, install the JDBC Oracle stage library, streamsets-datacollector-sdc-databricks-lib, to enable pipelines and jobs that use the Oracle Bulkload origin to run as expected.

Grant Users View Access for the Oracle CDC Origin

Starting with version 5.6.0, the Oracle CDC origin requires that the configured database user has access to the v$containers view.

After you upgrade to version 5.6.0 or later, run the following command in Oracle to grant the user account access to the view:

grant select on v$containers to <user name>;

For CDB databases, run the command from the root container, cdb$root. Then run it again from the pluggable database. For non-CDB databases, run the command from the primary database.

Update Origins and Processors that Read Compressed Files

Starting with version 5.6.0, origins that read compressed files require you to set the Compression Library property to properly read files compressed with the Airlift version of Snappy. Destinations compress files with the Airlift version of Snappy. This affects the HTTP Client processor and the following origins:

Amazon S3
Azure Blob Storage
Azure Data Lake Storage Gen1
Azure Data Lake Storage Gen2 (Legacy)
Azure IoT/Event Hub Consumer
CoAP Server
Directory
File Tail
Hadoop FS Standalone
Google Cloud Storage
Google Pub/Sub Subscriber
gRPC Client
HTTP Client
HTTP Server
Kafka Multitopic Consumer
MQTT Subscriber
REST Service
SFTP/FTP/FTPS Client
TCP Server
WebSocket Client
WebSocket Server

After you upgrade to version 5.6.0 or later, review your pipelines. In any origins and processors that read files compressed using the Airlift version of Snappy, including files produced by destinations, set the Compression Library property to Snappy (Airlift Snappy).

Install the Azure stage library

Starting with version 5.5.0, the Azure Synapse SQL destination and Azure Synapse connection require the installation of the Azure stage library. In previous releases, the destination and connection required the Azure Synapse Enterprise stage library.

After you upgrade to version 5.5.0 or later, install the Azure stage library, streamsets-datacollector-azure-lib, so that pipelines and jobs that use the Azure Synapse SQL destination or connection run as expected.

Review Salesforce pipelines

Starting with version 5.5.0, Salesforce stages correctly import date values as dates rather than as strings.

After you upgrade to version 5.5.0 or later, review pipelines with Salesforce stages and ensure that they do not expect dates to be imported as strings.

Review OPC UA Client Pipelines

Starting with version 5.5.0, the OPC UA Client origin no longer includes the Max Array Length or Max String Length properties. These properties were removed because they are redundant. The existing Max Message Size property properly limits the message size regardless of the data type of the message.

After you upgrade to version 5.5.0 or later, review OPC UA Client pipelines to ensure that the configuration for the Max Message Size property is appropriate for the pipeline. The default maximum message size is 2097152.

Install the Snowflake Stage Library to Use Snowflake

Starting with version 5.4.0, using Snowflake stages and Snowflake connections requires installing the Snowflake stage library. In previous releases, Snowflake stages and connections were available with the Snowflake Enterprise stage library.

After you upgrade to 5.4.0 or later, install the Snowflake stage library, streamsets-datacollector-sdc-snowflake-lib, to enable pipelines and jobs that use Snowflake stages or connections to run as expected.

Install the Google Cloud Stage Library to Use BigQuery

Starting with version 5.3.0, using Google BigQuery stages and Google BigQuery connections requires installing the Google Cloud stage library. In previous releases, BigQuery stages and connections were available with the Google BigQuery Enterprise stage library.

After you upgrade to version 5.3.0 or later, install the Google Cloud stage library, streamsets-datacollector-google-cloud-lib, to enable pipelines and jobs using BigQuery stages or connections to run as expected.

Review JDBC Multitable Consumer Pipelines

Starting with version 5.3.0, the Minimum Idle Connections property in the JDBC Multitable Consumer origin cannot be set higher than the Number of Threads property. In previous releases, there was no limit to the number of minimum idle connections that you could configure.

Upgraded pipelines have the Minimum Idle Connections property set to the same value as the Number of Threads property.

After you upgrade to version 5.3.0 or later, review JDBC Multitable Consumer origin pipelines to ensure that the new value for the Minimum Idle Connections property is appropriate for each pipeline.

Review Missing Field Behavior for Field Replacer Processors

Starting with version 5.3.0, the advanced Field Does Not Exist property in the Field Replacer processor has the following two new options that replace the Include without Processing option:

Add New Field - Adds the fields defined on the Replace tab to records if they do not exist.
Ignore New Field - Ignores any fields defined on the Replace tab if they do not exist.

After you upgrade to version 5.3.0 or later, the Field Does Not Exist property is set to Add New Field. Review Field Replacer pipelines to ensure that this behavior is appropriate.

Review runtime:loadResource Pipelines

Starting with version 5.3.0, pipelines that include the runtime:loadResource function fail with errors when the function calls a missing or empty resource file. In previous releases, those pipelines sometimes continued to run without errors.

After you upgrade to version 5.3.0 or later, review pipelines that use the runtime:loadresource function and ensure that the function calls resource files that include the required information.

Manage Underscores in Snowflake Connection Information

Starting with the Snowflake JDBC driver 3.13.25 release in November 2022, the Snowflake JDBC driver converts underscores to hyphens, by default. This can adversely affect communicating with Snowflake when Snowflake connection information specified in a Snowflake stage or connection, such as a URL, includes underscores.

After you upgrade to Snowflake JDBC driver 3.13.25 or later, review your Snowflake connection information for underscores.

When needed, you can bypass the default driver behavior by setting the allowUnderscoresInHost driver property to true. For more information and alternate solutions, see this Snowflake community article.

Review MySQL Binary Log Pipelines

Starting with version 5.2.0, the MySQL Binary Log origin converts MySQL Enum and Set fields to String fields.

In previous releases, when reading from a database where the binlog_row_metadata MySQL database property is set to MINIMAL, Enum fields are converted to Long, and Set fields are converted to Integer.

In version 5.2.0 as well as previous releases, when the binlog_row_metadata MySQL database property is set to FULL, Enum and Set fields are converted to String.

After you upgrade to version 5.2.0, review MySQL Binary Log pipelines that process Enum and Set data from a database with binlog_row_metadata set to MINIMAL. Update the pipeline as needed to ensure that Enum and Set data is processed as expected.

Review Blob and Clob Processing in Oracle CDC Client Pipelines

Starting with version 5.2.0, the Oracle CDC Client origin has new advanced properties that enable processing Blob and Clob columns. You can use these properties when the origin buffers changes locally. They are disabled by default.

In previous releases, the origin does not process Blob or Clob columns. However, when the Unsupported Fields to Records property is enabled, the origin includes Blob and Clob field names and raw string values.

Due to a known issue with this release, when the origin is not configured to process Blob and Clob columns and when the Unsupported Fields to Records property is enabled, the origin continues to include Blob and Clob field names and raw string values. When the property is disabled, the origin includes Blob and Clob field names with null values. The expected behavior is to always include field names with null values unless the origin is configured to process Blob and Clob columns.

Review Oracle CDC Client pipelines to assess how they should handle Blob and Clob columns:

To process Blob and Clob columns, enable Blob and Clob processing on the Advanced tab. You can optionally define a maximum LOB size.
Verify that sufficient memory is available to Data Collector before enabling Blob and Clob processing.
If the origin has the Unsupported Fields to Records property enabled, the origin continues to include Blob and Clob field names and raw string values, as in previous releases.
If the origin has the Unsupported Fields to Records property disabled, and if null values are acceptable for Blob and Clob fields, then no action is required at this time.
In a future release, this behavior will change so the Unsupported Fields to Records property has no effect on how Blob and Clob columns are processed.

Review Error Handling for Snowflake CDC Pipelines

In previous releases of the Snowflake Enterprise stage library, when the Snowflake destination runs a MERGE query that fails to write all CDC data in a batch to Snowflake, the Snowflake destination generates a stage error indicating that there was a difference between the number of records expected to be written and the number of records actually written to Snowflake.

The destination does not provide additional detail because Snowflake does not provide information about the individual records that failed to be written when a query fails.

Starting with version 1.12.0 of the Snowflake Enterprise stage library, when a query that writes CDC data fails, in addition to generating the stage error, the Snowflake destination passes all records in the batch to error handling. As a result, the error records are handled based on the error handling configured for the stage and pipeline.

Review stage and pipeline error handling for Snowflake CDC pipelines to ensure that error records are handled appropriately.

Note: The error records passed to error handling have been processed by the Snowflake destination. For example, if the batch includes three records that update the same row, they are merged into a single update record.

Review SQL Server Pipelines with Unencrypted Connections

Starting with version 5.1.0, Data Collector uses Microsoft JDBC Driver for SQL Server version 10.2.1 to connect to Microsoft SQL Server. According to Microsoft, this version has introduced a breaking backward-incompatible change.

As a result, after you upgrade to 5.1.0 or later, upgraded pipelines that connect to Microsoft SQL Server without SSL/TLS encryption will likely fail with a message such as the following:

The driver could not establish a secure connection to SQL Server by using Secure Sockets Layer (SSL) encryption.

This issue can be resolved by configuring SSL/TLS encryption between Microsoft SQL Server and Data Collector. For details on configuring clients for SSL/TLS encryption, see the Microsoft SQL Server documentation.

Otherwise, you can address this issue at a pipeline level by adding encrypt=false to the connection string, or by adding encrypt as an additional JDBC property and setting it to false.

To avoid having to update all affected pipelines immediately, you can configure Data Collector to attempt to disable SSL/TLS for all pipelines that use a JDBC driver. To do so, set the stage.conf_com.streamsets.pipeline.lib.jdbc.disableSSL Data Collector configuration property to true. Note that this property affects all JDBC drivers, and should typically be used only as a stopgap measure. For more information about the configuration property, see Configuring Data Collector.

Review Dockerfiles for Custom Docker Images

Starting with version 5.1.0, the Data Collector Docker image uses Ubuntu 20.04 LTS (Focal Fossa) as a parent image. In previous releases, the Data Collector Docker image used Alpine Linux as a parent image.

If you build custom Data Collector images using streamsets/datacollector version 5.0.0 or earlier as the parent image, review your Dockerfiles and make all required updates to become compatible with Ubuntu Focal Fossa before you build a custom image based on streamsets/datacollector:5.1.0 or later versions.

Review Oracle CDC Client Local Buffer Pipelines

Starting with version 5.1.0, pipelines that include the Oracle CDC Client origin no longer report memory consumption data when the origin uses local buffers. In previous releases, this reporting occurred by default, which slowed pipeline performance.

After you upgrade to Data Collector 5.1.0 or later, memory consumption reporting for Oracle CDC Client local buffer usage is no longer performed by default. If you require this information, you can enable it by setting the stage.conf_com.streamsets.pipeline.stage.origin.jdbc.cdc.oracle.monitorbuffersize Data Collector configuration property to true.

This property enables memory consumption data reporting for all Oracle CDC Client pipelines that use local buffering. Because it slows pipeline performance, as a best practice, enable the property only for short term troubleshooting.

Update Oracle CDC Client Origin User Accounts

Starting with version 5.0.0, the Oracle CDC Client origin requires additional Oracle permissions to ensure appropriate handling of self-recovery, failover, and crash recovery.

After you upgrade to version 5.0.0 or later, use the following GRANT statements to update the Oracle user account associated with the origin:

GRANT select on GV_$ARCHIVED_LOG to <user name>;
GRANT select on GV_$INSTANCE to <user name>;
GRANT select on GV_$LOG to <user name>; 
GRANT select on V_$INSTANCE to <user name>;

Review Couchbase Pipelines

Starting with version 4.4.0, the Couchbase stage library no longer includes an encryption JAR file that the Couchbase stages do not directly use. Removing the JAR file should not affect pipelines using Couchbase stages.

However, if Couchbase pipelines display errors about classes or methods not being found, you can install the following encryption JAR file as an external library for the Couchbase stage library:

https://search.maven.org/artifact/com.couchbase.client/encryption/1.0.0/jar

To install an external library, see Install External Libraries.

Update Keystore Location

Starting with version 4.2.0, when you enable HTTPS for Data Collector, you can store the keystore file in the Data Collector resources directory, <installation_dir>/externalResources/resources. You can then enter a path relative to that directory when you define the keystore location in the Data Collector configuration properties.

In previous releases, you can store the keystore file in the Data Collector configuration directory, <installation_dir>/etc, and then define the location to the file using a path relative to that directory. You can continue to store the file in the configuration directory, but it is best practice to move it to the resources directory when you upgrade.

Review Tableau CRM Pipelines

Starting with version 4.2.0, the Tableau CRM destination, previously known as the Einstein Analytics destination, writes to Salesforce differently from versions 3.7.0 - 4.1.x. When upgrading from version 3.7.0 - 4.1.x, review Tableau CRM pipelines to ensure that the destination behaves appropriately. When upgrading from a version prior to 3.7.0, no action is needed.

With version 4.2.0 and later, the destination writes to Salesforce by uploading batches of data to Salesforce, then signaling Salesforce to process the dataset after a configurable interval when no new data arrives. You configure the interval with the Dataset Wait Time stage property.

In versions 3.7.0 - 4.1.x, the destination signals Salesforce to process data after uploading each batch, effectively treating each batch as a dataset and making the Dataset Wait Time property irrelevant.

After upgrading from version 3.7.0 - 4.1.x to version 4.2.0 or later, verify that the destination behavior is as expected. If necessary, update the Dataset Wait Time property to indicate the interval that Salesforce should wait before processing each dataset.