Post Upgrade Tasks
After you upgrade Data Collector, complete the following tasks, as needed.
Review Dockerfiles for Custom Docker Images
In previous releases, the Data Collector Docker image used Ubuntu as a parent image. Starting with version 6.0.0, the Data Collector Docker image uses Red Hat Enterprise Linux as a parent image.
streamsets/datacollector
as the parent
image, review your Dockerfiles and make all required updates to become compatible with
Red Hat Enterprise Linux before you build a custom Data Collector
image.Removed Antenna Doctor
In previous versions of Data Collector, you could configure Antenna Doctor to suggest potential fixes and workarounds to common pipeline issues. Starting with version 6.0, Antenna Doctor is not included in Data Collector.
After upgrading to version 6.0 or later, Antenna Doctor does not send pipeline messages.
Review Azure Blob Storage Origins
In previous versions of Data Collector, the default spooling period for Azure Blob Storage origins was 5 seconds. Starting with version 6.0, the default spooling period is 30 seconds, and upgraded origins are given the new default value of 30 seconds.
After upgrading to version 6.0 or later, review upgraded Azure Blob Storage origins to ensure they are configured with an appropriate spooling period.
Review HashiCorp Vault Credential Stores
In previous versions of Data Collector, the HashiCorp Vault credential store authMethod property could be empty. Starting with version 6.0, the authentication method must be set to one of the following values:
-
appId
-
appRole
-
azure
After upgrading to version 6.0 or later, verify that the authentication method for upgraded credential stores is set to a valid value.
Install the CONNX JDBC driver for upgraded CONNX and CONNX CDC origins
Starting with version 5.12.0, the CONNX JDBC driver is no longer included in the
streamsets-datacollector-connx-lib
stage library. You must manually
install the driver into the stage library before using a CONNX or CONNX CDC origin.
After upgrading to version 5.12.0 or later, if you have upgraded pipelines using a CONNX or CONNX CDC origin, install the
CONNX JDBC driver as an external
library for the streamsets-datacollector-connx-lib
stage
library.
Install the Oracle JDBC driver for upgraded Oracle Multitable Consumer origins and Oracle destinations
Starting with version 5.12.0, the Oracle JDBC driver is no longer included in the
streamsets-datacollector-jdbc-branded-oracle-lib
stage library. You
must manually install the driver into the stage library before using an Oracle
Multitable Consumer origin or Oracle destination.
After upgrading to version 5.12.0 or later, if you have upgraded pipelines using an Oracle Multitable Consumer
origin or Oracle
destination, install the Oracle JDBC driver as an external library for the
streamsets-datacollector-jdbc-branded-oracle-lib
stage library.
Review Pipelines with Google BigQuery or Snowflake Destinations Writing JSON Data
Starting with version 5.11.0, you cannot configure characters to represent null values or newline characters for Google BigQuery or Snowflake destinations when writing JSON data. Upgraded destinations do not change null values or newline characters.
After upgrading to version 5.11.0 or later, review pipelines using Google BigQuery or Snowflake destinations writing JSON data to ensure the destination does not receive any null values or newline characters from the pipeline that should not be passed to the external system.
Review Snowflake File Uploader Staging Details
Starting with version 5.11.0, the Database and Schema properties have been removed from the Snowflake File Uploader destination. In Data Collector 5.10.0, if the Stage Database and Stage Schema properties were not configured, the destination used the database and schema values configured for the destination or Control Hub connection instead.
When upgrading from version 5.10.0 to 5.11.0 or later, Snowflake File Uploader destinations that do not use a Control Hub connection and do not have values configured for the Stage Database or Stage Schema properties are assigned a staging database value equal to the configured database value and a staging schema value equal to the configured schema value. Snowflake File Uploader destinations that use a Control Hub connection and do not have values configured for the Stage Database or Stage Schema properties are not assigned any values for these properties and must have them configured after upgrading.
After upgrading to version 5.11.0 or later from version 5.10.0, review Snowflake File Uploader destinations that use Control Hub connections and make sure the Stage Database and Stage Schema properties are configured.
Review Pipeline Notification Email Configurations
Starting with version 5.11.0, the format of pipeline notification emails has changed to include a new error code format.
After upgrading to version 5.11.0 or later, review notification email configurations for upgraded pipelines to ensure they behave as expected.
Review the Batch Wait Time for Directory Origins
Starting with version 5.11.0, the origin correctly interprets the batch wait time value as seconds. In earlier releases, the origin incorrectly interpreted the value as milliseconds.
After upgrading to version 5.11.0 or later, review Directory origins to ensure they are configured with an appropriate batch wait time.
Review the Oracle CDC Client Record Cache Size
Starting with version 5.10.0, you can configure the maximum size of the record cache for an Oracle CDC Client origin using the Records Cache Size property. Upgraded pipelines are given the default value of -2, which represents two times the batch size.
After upgrading to Data Collector 5.10.0 or later, verify that Oracle CDC Client origins are configured with the appropriate record cache size.
Review Search Mode Behavior for Start Jobs Pipelines
Starting with version 5.10.0, Start Jobs stages have updated search mode options. Pipelines upgraded from version 5.2.x or earlier that were configured with the contain search mode option are updated to use the new contains unique search mode option.
After upgrading to Data Collector 5.10.0 or later from Data Collector 5.2.x or earlier, verify that Start Jobs pipelines are using the appropriate search mode.
Review the Maximum Batch Vault Size for Oracle CDC Origin Pipelines
Starting with version 5.9.0, the Oracle CDC origin has a Max Batch Vault Size property that allows you to configure the maximum number of batches the origin pre-generates while the pipeline is processing other batches. In upgraded pipelines, the origin uses the default maximum batch vault size of 64.
After you upgrade to version 5.9.0 or later, review Oracle CDC origin pipelines. If the maximum batch vault size is not appropriate, update the pipelines accordingly.
Review Amazon, Azure, Data Parser, JMS Consumer, and Pulsar Consumer Origin Pipelines
-
Amazon S3
-
Amazon SQS Consumer
-
Azure Blob Storage
-
Azure Data Lake Storage Gen2
-
Data Parser
-
JMS Consumer
-
Pulsar Consumer
-
Pulsar Consumer (Legacy)
When configured to read tables that contain duplicate column names, the origin treats the tables as invalid and generates an error.
After you upgrade to version 5.9.0 or later, review pipelines that use these origins. If any pipelines require the ability to read tables containing multiple columns with the same name, configure the origins to ignore column headers.
Review JDBC Lookup Processor SQL Query Configuration
Starting with version 5.9.0, stability and performance improvements to the JDBC Lookup processor cause the processor to strictly enforce the requirement of a WHERE clause in SQL queries.
After upgrading to version 5.9.0 or later, verify that the SQL Query property for each JDBC Lookup processor is configured with a WHERE clause.
Review Oracle Bulkload Origin Pipelines
Starting with version 5.8.0, pipelines using the Oracle Bulkload origin no longer fail when the origin encounters an empty table. This change might cause Oracle Bulkload pipelines created with earlier versions of Data Collector to behave in unexpected ways.
After you upgrade to version 5.8.0 or later, review any pipelines that use the Oracle Bulkload origin to ensure they behave as expected.
Update stages that were using Enterprise stage libraries
Starting with version 5.8.0, Data Collector no longer supports Enterprise stage libraries.
- Protector
- Microsoft SQL Server 2019 Big Data Cluster
Grant Users View Access for the Oracle CDC Origin
Starting with version 5.7.0, the Oracle CDC origin must use a user account with access to the all_tab_cols view.
grant select on all_tab_cols to <user name>;
For CDB databases, run the command from the root container, cdb$root
.
Then run it again from the pluggable database. For non-CDB databases, run the command
from the primary database.
Review Amazon S3, Azure Blob Storage, and Azure Data Lake Storage Gen2 Origin Pipelines
Starting with version 5.7.0, Amazon S3, Azure Blob Storage, and Azure Data Lake Storage Gen2 origins have a File Processing Delay property that allows you to configure the minimum number of milliseconds that must pass from the time a file is created before it is processed. In upgraded pipelines these origins receive the default file processing delay of 10,000 milliseconds.
After you upgrade to version 5.7.0 or later, review pipelines that include the Amazon S3, Azure Blob Storage, and Azure Data Lake Storage Gen2 origins. If the 10,000 millisecond delay is not appropriate, update the pipelines accordingly.
Review the Batch Wait Time for ALDS Gen2 (Legacy), Directory, and Hadoop FS Standalone Origins
Versions of Data Collector prior to 5.7.0 incorrectly treated the batch wait time value configured for ALDS Gen2 (Legacy), Directory, and Hadoop FS Standalone origins as milliseconds instead of seconds. Starting with version 5.7.0, Data Collector treats the batch wait time value as seconds, which can increase the wait time for empty batches in upgraded pipelines. Versions of Data Collector prior to 5.7.0 incorrectly treat the batch wait time value configured for ALDS Gen2 (Legacy), Directory, and Hadoop FS Standalone origins as milliseconds instead of seconds.
After upgrading to version 5.7.0 or later, review the batch wait time for ALDS Gen2 (Legacy), Directory, and Hadoop FS Standalone origins in upgraded pipelines, and update the value if necessary.
Review Amazon S3 and Databricks Delta Lake Stages
- Bucket property for the Amazon S3 origin
- Bucket and path property for the Amazon S3 destination and executor
- Bucket property for the Databricks Delta Lake destination when staging files to Amazon S3
For more information about this change, see the aws-sdk-java list of Amazon S3 bug fixes.
- Amazon S3 origin - Common Prefix and Prefix Pattern properties
- Amazon S3 destination - Common Prefix and Partition Prefix properties
- Amazon S3 executor - Object property on the Tasks tab
- Databricks Delta Lake destination - Stage File Prefix property on the Staging tab
After you upgrade to version 5.6.0 or later, review the bucket property in these stages to ensure that the property defines the bucket name only. Modify the properties as needed to define only the bucket name in the bucket property and to define the path in the remaining properties.
- Bucket:
orders/US/West
- Common Prefix:
- Prefix Pattern:
**/*.log
- Bucket:
orders
- Common Prefix:
US/West/
- Prefix Pattern:
**/*.log
Install the Databricks Stage Library
Starting with version 5.6.0, the Databricks Delta Lake destination, Databricks Query executor, and Databricks Delta Lake connection require the Databricks stage library. In previous releases, they required the Databricks Enterprise stage library.
After you upgrade to version 5.6.0 or later, install the Databricks stage library,
streamsets-datacollector-sdc-databricks-lib
, to enable pipelines
and jobs that use these Databricks stages or the Databricks connection to run as
expected.
Review Databricks Stages
Starting with version 5.6.0, the scheme of the URL or connection string for the
Databricks Delta Lake destination and Databricks Query executor is
jdbc:databricks
rather than jdbc:spark
.
jdbc:databricks
. Update the Databricks Delta Lake Connection
Starting with version 5.6.0, the scheme of the URL is jdbc:databricks
rather than jdbc.spark
.
After you update a connection to use a version 5.6.0 or later authoring Data Collector,
edit the JDBC URL property to use the jdbc:databricks
scheme.
Review Scripts in Jython Stages
Starting with version 5.6.0, Jython stages uses Jython 2.7.3 to process data.
After you upgrade to version 5.6.0 or later, review the scripts used in the Jython Scripting origin and the Jython Evaluator processor to ensure that they process data as expected.
Install the JDBC Oracle Stage Library
Starting with version 5.6.0, the Oracle Bulkload origin requires the JDBC Oracle stage library. In previous releases, the origin required the Oracle Enterprise stage library.
After you upgrade to version 5.6.0 or later, install the JDBC Oracle stage library,
streamsets-datacollector-sdc-databricks-lib
, to enable pipelines
and jobs that use the Oracle Bulkload origin to run as expected.
Grant Users View Access for the Oracle CDC Origin
Starting with version 5.6.0, the Oracle CDC origin requires that the configured database
user has access to the v$containers
view.
grant select on v$containers to <user name>;
For CDB databases, run the command from the root container, cdb$root
.
Then run it again from the pluggable database. For non-CDB databases, run the command
from the primary database.
Update Origins and Processors that Read Compressed Files
- Amazon S3
- Azure Blob Storage
- Azure Data Lake Storage Gen1
- Azure Data Lake Storage Gen2 (Legacy)
- Azure IoT/Event Hub Consumer
- CoAP Server
- Directory
- File Tail
- Hadoop FS Standalone
- Google Cloud Storage
- Google Pub/Sub Subscriber
- gRPC Client
- HTTP Client
- HTTP Server
- Kafka Multitopic Consumer
- MQTT Subscriber
- REST Service
- SFTP/FTP/FTPS Client
- TCP Server
- WebSocket Client
- WebSocket Server
After you upgrade to version 5.6.0 or later, review your pipelines. In any origins and processors that read files compressed using the Airlift version of Snappy, including files produced by destinations, set the Compression Library property to Snappy (Airlift Snappy).
Install the Azure stage library
Starting with version 5.5.0, the Azure Synapse SQL destination and Azure Synapse connection require the installation of the Azure stage library. In previous releases, the destination and connection required the Azure Synapse Enterprise stage library.
After you upgrade to version 5.5.0 or later, install the Azure stage library,
streamsets-datacollector-azure-lib
, so that pipelines and jobs that
use the Azure Synapse SQL destination or connection run as expected.
Review Salesforce pipelines
Starting with version 5.5.0, Salesforce stages correctly import date values as dates rather than as strings.
After you upgrade to version 5.5.0 or later, review pipelines with Salesforce stages and ensure that they do not expect dates to be imported as strings.
Review OPC UA Client Pipelines
Starting with version 5.5.0, the OPC UA Client origin no longer includes the Max Array Length or Max String Length properties. These properties were removed because they are redundant. The existing Max Message Size property properly limits the message size regardless of the data type of the message.
After you upgrade to version 5.5.0 or later, review OPC UA Client pipelines to ensure that the configuration for the Max Message Size property is appropriate for the pipeline. The default maximum message size is 2097152.
Install the Snowflake Stage Library to Use Snowflake
Starting with version 5.4.0, using Snowflake stages and Snowflake connections requires installing the Snowflake stage library. In previous releases, Snowflake stages and connections were available with the Snowflake Enterprise stage library.
After you upgrade to 5.4.0 or later, install the Snowflake stage library,
streamsets-datacollector-sdc-snowflake-lib
, to enable pipelines and
jobs that use Snowflake stages or connections to run as expected.
Install the Google Cloud Stage Library to Use BigQuery
Starting with version 5.3.0, using Google BigQuery stages and Google BigQuery connections requires installing the Google Cloud stage library. In previous releases, BigQuery stages and connections were available with the Google BigQuery Enterprise stage library.
After you upgrade to version 5.3.0 or later, install the Google Cloud stage library,
streamsets-datacollector-google-cloud-lib
, to enable pipelines and
jobs using BigQuery stages or connections to run as expected.
Review JDBC Multitable Consumer Pipelines
Starting with version 5.3.0, the Minimum Idle Connections property in the JDBC Multitable Consumer origin cannot be set higher than the Number of Threads property. In previous releases, there was no limit to the number of minimum idle connections that you could configure.
Upgraded pipelines have the Minimum Idle Connections property set to the same value as the Number of Threads property.
After you upgrade to version 5.3.0 or later, review JDBC Multitable Consumer origin pipelines to ensure that the new value for the Minimum Idle Connections property is appropriate for each pipeline.
Review Missing Field Behavior for Field Replacer Processors
- Add New Field - Adds the fields defined on the Replace tab to records if they do not exist.
- Ignore New Field - Ignores any fields defined on the Replace tab if they do not exist.
After you upgrade to version 5.3.0 or later, the Field Does Not Exist property is set to Add New Field. Review Field Replacer pipelines to ensure that this behavior is appropriate.
Review runtime:loadResource Pipelines
Starting with version 5.3.0, pipelines that include the
runtime:loadResource
function fail with errors when the function
calls a missing or empty resource file. In previous releases, those pipelines sometimes
continued to run without errors.
After you upgrade to version 5.3.0 or later, review pipelines that use the
runtime:loadresource
function and ensure that the function calls
resource files that include the required information.
Manage Underscores in Snowflake Connection Information
Starting with the Snowflake JDBC driver 3.13.25 release in November 2022, the Snowflake JDBC driver converts underscores to hyphens, by default. This can adversely affect communicating with Snowflake when Snowflake connection information specified in a Snowflake stage or connection, such as a URL, includes underscores.
After you upgrade to Snowflake JDBC driver 3.13.25 or later, review your Snowflake connection information for underscores.
When needed, you can bypass the default driver behavior by setting the
allowUnderscoresInHost
driver property to true
.
For more information and alternate solutions, see this Snowflake community article.
Review MySQL Binary Log Pipelines
Starting with version 5.2.0, the MySQL Binary Log origin converts MySQL Enum and Set fields to String fields.
In previous releases, when reading
from a database where the binlog_row_metadata
MySQL database
property is set to MINIMAL
, Enum fields are converted to Long,
and Set fields are converted to Integer.
In version 5.2.0 as
well as previous releases, when the binlog_row_metadata
MySQL
database property is set to FULL
, Enum and Set fields are
converted to String.
binlog_row_metadata
set to MINIMAL
. Update the
pipeline as needed to ensure that Enum and Set data is processed as expected. Review Blob and Clob Processing in Oracle CDC Client Pipelines
Starting with version 5.2.0, the Oracle CDC Client origin has new advanced properties that enable processing Blob and Clob columns. You can use these properties when the origin buffers changes locally. They are disabled by default.
In previous releases, the origin does not process Blob or Clob columns. However, when the Unsupported Fields to Records property is enabled, the origin includes Blob and Clob field names and raw string values.
Due to a known issue with this release, when the origin is not configured to process Blob and Clob columns and when the Unsupported Fields to Records property is enabled, the origin continues to include Blob and Clob field names and raw string values. When the property is disabled, the origin includes Blob and Clob field names with null values. The expected behavior is to always include field names with null values unless the origin is configured to process Blob and Clob columns.
- To process Blob and Clob columns, enable Blob and Clob processing on the
Advanced tab. You can optionally define a maximum LOB size.
Verify that sufficient memory is available to Data Collector before enabling Blob and Clob processing.
- If the origin has the Unsupported Fields to Records property enabled, the origin
continues to include Blob and Clob field names and raw string values, as in
previous releases.
If the origin has the Unsupported Fields to Records property disabled, and if null values are acceptable for Blob and Clob fields, then no action is required at this time.
In a future release, this behavior will change so the Unsupported Fields to Records property has no effect on how Blob and Clob columns are processed.
Review Error Handling for Snowflake CDC Pipelines
In previous releases of the Snowflake Enterprise stage library, when the Snowflake destination runs a MERGE query that fails to write all CDC data in a batch to Snowflake, the Snowflake destination generates a stage error indicating that there was a difference between the number of records expected to be written and the number of records actually written to Snowflake.
The destination does not provide additional detail because Snowflake does not provide information about the individual records that failed to be written when a query fails.
Starting with version 1.12.0 of the Snowflake Enterprise stage library, when a query that writes CDC data fails, in addition to generating the stage error, the Snowflake destination passes all records in the batch to error handling. As a result, the error records are handled based on the error handling configured for the stage and pipeline.
Review stage and pipeline error handling for Snowflake CDC pipelines to ensure that error records are handled appropriately.
Review SQL Server Pipelines with Unencrypted Connections
Starting with version 5.1.0, Data Collector uses Microsoft JDBC Driver for SQL Server version 10.2.1 to connect to Microsoft SQL Server. According to Microsoft, this version has introduced a breaking backward-incompatible change.
The driver could not establish a secure connection to SQL Server by using Secure Sockets Layer (SSL) encryption.
This issue can be resolved by configuring SSL/TLS encryption between Microsoft SQL Server and Data Collector. For details on configuring clients for SSL/TLS encryption, see the Microsoft SQL Server documentation.
Otherwise, you can address this issue at a
pipeline level by adding encrypt=false
to the connection
string, or by adding encrypt
as an additional JDBC property
and setting it to false
.
To avoid having to update all affected pipelines
immediately, you can configure Data Collector to attempt
to disable SSL/TLS for all pipelines that use a JDBC driver. To do so, set
the stage.conf_com.streamsets.pipeline.lib.jdbc.disableSSL
Data Collector configuration
property to true
. Note that this property affects
all JDBC drivers, and should typically be used only as a
stopgap measure. For more information about the configuration property, see
Configuring Data Collector.
Review Dockerfiles for Custom Docker Images
Starting with version 5.1.0, the Data Collector Docker image uses Ubuntu 20.04 LTS (Focal Fossa) as a parent image. In previous releases, the Data Collector Docker image used Alpine Linux as a parent image.
If you build custom Data Collector
images using streamsets/datacollector
version 5.0.0 or earlier as the
parent image, review your Dockerfiles and make all required updates to become compatible
with Ubuntu Focal Fossa before you build a custom image based on
streamsets/datacollector:5.1.0
or later versions.
Review Oracle CDC Client Local Buffer Pipelines
Starting with version 5.1.0, pipelines that include the Oracle CDC Client origin no longer report memory consumption data when the origin uses local buffers. In previous releases, this reporting occurred by default, which slowed pipeline performance.
After you upgrade to Data Collector
5.1.0 or later, memory consumption reporting
for Oracle CDC Client local buffer usage is no longer performed by default.
If you require this information, you can enable it by setting the
stage.conf_com.streamsets.pipeline.stage.origin.jdbc.cdc.oracle.monitorbuffersize
Data Collector configuration
property to true
.
This property enables memory consumption data reporting for all Oracle CDC Client pipelines that use local buffering. Because it slows pipeline performance, as a best practice, enable the property only for short term troubleshooting.
Update Oracle CDC Client Origin User Accounts
Starting with version 5.0.0, the Oracle CDC Client origin requires additional Oracle permissions to ensure appropriate handling of self-recovery, failover, and crash recovery.
GRANT select on GV_$ARCHIVED_LOG to <user name>;
GRANT select on GV_$INSTANCE to <user name>;
GRANT select on GV_$LOG to <user name>;
GRANT select on V_$INSTANCE to <user name>;
Review Couchbase Pipelines
Starting with version 4.4.0, the Couchbase stage library no longer includes an encryption JAR file that the Couchbase stages do not directly use. Removing the JAR file should not affect pipelines using Couchbase stages.
However, if Couchbase pipelines display errors about classes or methods not being found, you can install the following encryption JAR file as an external library for the Couchbase stage library:
https://search.maven.org/artifact/com.couchbase.client/encryption/1.0.0/jar
To install an external library, see Install External Libraries.
Update Keystore Location
Starting with version 4.2.0, when you enable HTTPS for Data Collector, you can store the keystore file in the Data Collector resources directory, <installation_dir>/externalResources/resources. You can then enter a path relative to that directory when you define the keystore location in the Data Collector configuration properties.
In previous releases, you can store the keystore file in the Data Collector configuration directory, <installation_dir>/etc, and then define the location to the file using a path relative to that directory. You can continue to store the file in the configuration directory, but it is best practice to move it to the resources directory when you upgrade.
Review Tableau CRM Pipelines
Starting with version 4.2.0, the Tableau CRM destination, previously known as the Einstein Analytics destination, writes to Salesforce differently from versions 3.7.0 - 4.1.x. When upgrading from version 3.7.0 - 4.1.x, review Tableau CRM pipelines to ensure that the destination behaves appropriately. When upgrading from a version prior to 3.7.0, no action is needed.
With version 4.2.0 and later, the destination writes to Salesforce by uploading batches of data to Salesforce, then signaling Salesforce to process the dataset after a configurable interval when no new data arrives. You configure the interval with the Dataset Wait Time stage property.
In versions 3.7.0 - 4.1.x, the destination signals Salesforce to process data after uploading each batch, effectively treating each batch as a dataset and making the Dataset Wait Time property irrelevant.
After upgrading from version 3.7.0 - 4.1.x to version 4.2.0 or later, verify that the destination behavior is as expected. If necessary, update the Dataset Wait Time property to indicate the interval that Salesforce should wait before processing each dataset.