Post Upgrade Tasks
After you upgrade Data Collector, complete the following tasks, as needed.
Install the Azure stage library
Starting with version 5.5.0, the Azure Synapse SQL destination and Azure Synapse connection require the installation of the Azure stage library. In previous releases, the destination and connection required the Azure Synapse Enterprise stage library.
After you upgrade to version 5.5.0, install the Azure stage library,
streamsets-datacollector-azure-lib
, so that pipelines and jobs that
use the Azure Synapse SQL destination or connection run as expected.
Review Salesforce pipelines
Starting with version 5.5.0, Salesforce stages correctly import date values as dates rather than as strings.
After you upgrade to version 5.5.0, review pipelines with Salesforce stages and ensure that they do not expect dates to be imported as strings.
Review OPC UA Client Pipelines
Starting with version 5.5.0, the OPC UA Client origin no longer includes the Max Array Length or Max String Length properties. These properties were removed because they are redundant. The existing Max Message Size property properly limits the message size regardless of the data type of the message.
After you upgrade to version 5.5.0 or later, review OPC UA Client pipelines to ensure that the configuration for the Max Message Size property is appropriate for the pipeline. The default maximum message size is 2097152.
Install the Snowflake Stage Library to Use Snowflake
Starting with version 5.4.0, using Snowflake stages and Snowflake connections requires installing the Snowflake stage library. In previous releases, Snowflake stages and connections were available with the Snowflake Enterprise stage library.
After you upgrade to 5.4.0 or later, install the Snowflake stage library,
streamsets-datacollector-sdc-snowflake-lib
, to enable pipelines and
jobs that use Snowflake stages or connections to run as expected.
Install the Google Cloud Stage Library to Use BigQuery
Starting with version 5.3.0, using Google BigQuery stages and Google BigQuery connections requires installing the Google Cloud stage library. In previous releases, BigQuery stages and connections were available with the Google BigQuery Enterprise stage library.
After you upgrade to version 5.3.0 or later, install the Google Cloud stage library,
streamsets-datacollector-google-cloud-lib
, to enable pipelines and
jobs using BigQuery stages or connections to run as expected.
Review JDBC Multitable Consumer Pipelines
Starting with version 5.3.0, the Minimum Idle Connections property in the JDBC Multitable Consumer origin cannot be set higher than the Number of Threads property. In previous releases, there was no limit to the number of minimum idle connections that you could configure.
Upgraded pipelines have the Minimum Idle Connections property set to the same value as the Number of Threads property.
After you upgrade to version 5.3.0 or later, review JDBC Multitable Consumer origin pipelines to ensure that the new value for the Minimum Idle Connections property is appropriate for each pipeline.
Review Missing Field Behavior for Field Replacer Processors
- Add New Field - Adds the fields defined on the Replace tab to records if they do not exist.
- Ignore New Field - Ignores any fields defined on the Replace tab if they do not exist.
After you upgrade to version 5.3.0 or later, the Field Does Not Exist property is set to Add New Field. Review Field Replacer pipelines to ensure that this behavior is appropriate.
Review runtime:loadResource Pipelines
Starting with version 5.3.0, pipelines that include the
runtime:loadResource
function fail with errors when the function
calls a missing or empty resource file. In previous releases, those pipelines sometimes
continued to run without errors.
After you upgrade to version 5.3.0 or later, review pipelines that use the
runtime:loadresource
function and ensure that the function calls
resource files that include the required information.
Review MySQL Binary Log Pipelines
Starting with version 5.2.0, the MySQL Binary Log origin converts MySQL Enum and Set fields to String fields.
In previous releases, when reading
from a database where the binlog_row_metadata
MySQL database
property is set to MINIMAL
, Enum fields are converted to Long,
and Set fields are converted to Integer.
In version 5.2.0 as
well as previous releases, when the binlog_row_metadata
MySQL
database property is set to FULL
, Enum and Set fields are
converted to String.
binlog_row_metadata
set to MINIMAL
. Update the
pipeline as needed to ensure that Enum and Set data is processed as expected. Review Blob and Clob Processing in Oracle CDC Client Pipelines
Starting with version 5.2.0, the Oracle CDC Client origin has new advanced properties that enable processing Blob and Clob columns. You can use these properties when the origin buffers changes locally. They are disabled by default.
In previous releases, the origin does not process Blob or Clob columns. However, when the Unsupported Fields to Records property is enabled, the origin includes Blob and Clob field names and raw string values.
Due to a known issue with this release, when the origin is not configured to process Blob and Clob columns and when the Unsupported Fields to Records property is enabled, the origin continues to include Blob and Clob field names and raw string values. When the property is disabled, the origin includes Blob and Clob field names with null values. The expected behavior is to always include field names with null values unless the origin is configured to process Blob and Clob columns.
- To process Blob and Clob columns, enable Blob and Clob processing on the
Advanced tab. You can optionally define a maximum LOB size.
Verify that sufficient memory is available to Data Collector before enabling Blob and Clob processing.
- If the origin has the Unsupported Fields to Records property enabled, the origin
continues to include Blob and Clob field names and raw string values, as in
previous releases.
If the origin has the Unsupported Fields to Records property disabled, and if null values are acceptable for Blob and Clob fields, then no action is required at this time.
In a future release, this behavior will change so the Unsupported Fields to Records property has no effect on how Blob and Clob columns are processed.
Review Error Handling for Snowflake CDC Pipelines
In previous releases of the Snowflake Enterprise stage library, when the Snowflake destination runs a MERGE query that fails to write all CDC data in a batch to Snowflake, the Snowflake destination generates a stage error indicating that there was a difference between the number of records expected to be written and the number of records actually written to Snowflake.
The destination does not provide additional detail because Snowflake does not provide information about the individual records that failed to be written when a query fails.
Starting with version 1.12.0 of the Snowflake Enterprise stage library, when a query that writes CDC data fails, in addition to generating the stage error, the Snowflake destination passes all records in the batch to error handling. As a result, the error records are handled based on the error handling configured for the stage and pipeline.
Review stage and pipeline error handling for Snowflake CDC pipelines to ensure that error records are handled appropriately.
Review SQL Server Pipelines with Unencrypted Connections
Starting with version 5.1.0, Data Collector uses Microsoft JDBC Driver for SQL Server version 10.2.1 to connect to Microsoft SQL Server. According to Microsoft, this version has introduced a breaking backward-incompatible change.
The driver could not establish a secure connection to SQL Server by using Secure Sockets Layer (SSL) encryption.
This issue can be resolved by configuring SSL/TLS encryption between Microsoft SQL Server and Data Collector. For details on configuring clients for SSL/TLS encryption, see the Microsoft SQL Server documentation.
Otherwise, you can address this issue at a
pipeline level by adding encrypt=false
to the connection
string, or by adding encrypt
as an additional JDBC property
and setting it to false
.
To avoid having to update all affected pipelines
immediately, you can configure Data Collector to attempt
to disable SSL/TLS for all pipelines that use a JDBC driver. To do so, set
the stage.conf_com.streamsets.pipeline.lib.jdbc.disableSSL
Data Collector configuration
property to true
. Note that this property affects
all JDBC drivers, and should typically be used only as a
stopgap measure. For more information about the configuration property, see
Configuring Data Collector.
Review Dockerfiles for Custom Docker Images
Starting with version 5.1.0, the Data Collector Docker image uses Ubuntu 20.04 LTS (Focal Fossa) as a parent image. In previous releases, the Data Collector Docker image used Alpine Linux as a parent image.
If you build custom Data Collector
images using streamsets/datacollector
version 5.0.0 or earlier as the
parent image, review your Dockerfiles and make all required updates to become compatible
with Ubuntu Focal Fossa before you build a custom image based on
streamsets/datacollector:5.1.0
or later versions.
Review Oracle CDC Client Local Buffer Pipelines
Starting with version 5.1.0, pipelines that include the Oracle CDC Client origin no longer report memory consumption data when the origin uses local buffers. In previous releases, this reporting occurred by default, which slowed pipeline performance.
After you upgrade to Data Collector
5.1.0 or later, memory consumption reporting
for Oracle CDC Client local buffer usage is no longer performed by default.
If you require this information, you can enable it by setting the
stage.conf_com.streamsets.pipeline.stage.origin.jdbc.cdc.oracle.monitorbuffersize
Data Collector configuration
property to true
.
This property enables memory consumption data reporting for all Oracle CDC Client pipelines that use local buffering. Because it slows pipeline performance, as a best practice, enable the property only for short term troubleshooting.
Update Oracle CDC Client Origin User Accounts
Starting with version 5.0.0, the Oracle CDC Client origin requires additional Oracle permissions to ensure appropriate handling of self-recovery, failover, and crash recovery.
GRANT select on GV_$ARCHIVED_LOG to <user name>;
GRANT select on GV_$INSTANCE to <user name>;
GRANT select on GV_$LOG to <user name>;
GRANT select on V_$INSTANCE to <user name>;
Review Couchbase Pipelines
Starting with version 4.4.0, the Couchbase stage library no longer includes an encryption JAR file that the Couchbase stages do not directly use. Removing the JAR file should not affect pipelines using Couchbase stages.
However, if Couchbase pipelines display errors about classes or methods not being found, you can install the following encryption JAR file as an external library for the Couchbase stage library:
https://search.maven.org/artifact/com.couchbase.client/encryption/1.0.0/jar
To install an external library, see Install External Libraries.
Update Keystore Location
Starting with version 4.2.0, when you enable HTTPS for Data Collector, you can store the keystore file in the Data Collector resources directory, <installation_dir>/externalResources/resources. You can then enter a path relative to that directory when you define the keystore location in the Data Collector configuration properties.
In previous releases, you can store the keystore file in the Data Collector configuration directory, <installation_dir>/etc, and then define the location to the file using a path relative to that directory. You can continue to store the file in the configuration directory, but StreamSets recommends moving it to the resources directory when you upgrade.
Review Tableau CRM Pipelines
Starting with version 4.2.0, the Tableau CRM destination, previously known as the Einstein Analytics destination, writes to Salesforce differently from versions 3.7.0 - 4.1.x. When upgrading from version 3.7.0 - 4.1.x, review Tableau CRM pipelines to ensure that the destination behaves appropriately. When upgrading from a version prior to 3.7.0, no action is needed.
With version 4.2.0 and later, the destination writes to Salesforce by uploading batches of data to Salesforce, then signaling Salesforce to process the dataset after a configurable interval when no new data arrives. You configure the interval with the Dataset Wait Time stage property.
In versions 3.7.0 - 4.1.x, the destination signals Salesforce to process data after uploading each batch, effectively treating each batch as a dataset and making the Dataset Wait Time property irrelevant.
After upgrading from version 3.7.0 - 4.1.x to version 4.2.0 or later, verify that the destination behavior is as expected. If necessary, update the Dataset Wait Time property to indicate the interval that Salesforce should wait before processing each dataset.