Post Upgrade Tasks
After you upgrade Transformer, complete the following tasks as needed.
Review Delta Lake Stages Using Transformer-Provided Libraries
Starting with 5.7.0, the Transformer-provided libraries for Delta Lake stages have been upgraded from Delta Lake version 0.7.0 to version 2.4.0, which supports Spark 3.4.x.
After upgrading from an earlier Transformer version, review pipelines that include Delta Lake stages that use Transformer-provided libraries to ensure that the Spark upgrade does not adversely affect data processing.
For information about Delta Lake versions and Spark compatibility, see the Databricks documentation.
Review Drivers for Azure SQL Destinations
Starting with 5.6.0, the Azure SQL destination requires a Microsoft JDBC driver for SQL Server version 8 or later. In addition, Transformer also includes a Microsoft JDBC driver for SQL Server with the destination.
After upgrading from an earlier Transformer
version, if an existing Azure SQL pipeline uses an earlier driver version and is
configured to perform a bulk copy, the pipeline may fail with a
java.lang.NoSuchMethodError
error.
-
If the older driver is installed on the cluster, remove the driver or upgrade it to version 8 or later. For more information, see the Microsoft documentation.
-
If the older driver is installed as an external library for the Azure SQL destination, remove the existing version of the external library from Transformer to prevent Transformer from uploading it to the cluster. For more information, see Managing External Libraries.
Review Lookup Processors
Starting with 5.5.0, lookup processors use an internal row_number
column
to perform lookups and repartition data. This change applies to all lookup processors:
Delta Lake Lookup, JDBC Lookup, and Snowflake Lookup.
Due to the new internal column, with this release, a lookup processor configured to sort
columns and to return the first matching row can no longer include a
row_number
column in stage properties.
The same lookup processor also now creates partitions based on the specified lookup fields.
Return the First Matching Row
:- Ensure that the processor does not include a column named
row_number
in stage properties. - Verify that downstream pipeline processing is not negatively affected by the partitions created for the fields in the Lookup Field properties.
Review Dockerfiles for Custom Docker Images
In previous releases, the Transformer Docker images used Alpine Linux as a parent image.
Starting with version 5.4.0, the Transformer Docker images use Ubuntu 22.04 LTS (Jammy Jellyfish) as the parent image.
If you build custom Transformer images
using a Docker image for Transformer 5.3.x
or earlier as the parent image, review your Dockerfiles. Make all required updates so
they are compatible with Ubuntu Jammy Jellyfish before you build a custom image based on
Docker images for Transformer 5.4.0,
streamsets/transformer:scala-2.11_5.4.0
or
streamsets/transformer:scala-2.12_5.4.0
, or later versions.
Manage Underscores in Snowflake Connection Information
Starting with the Snowflake JDBC driver 3.13.25 release in November 2022, the Snowflake JDBC driver converts underscores to hyphens, by default. This can adversely affect communicating with Snowflake when Snowflake connection information specified in a Snowflake stage or connection, such as a URL, includes underscores.
After you upgrade to Snowflake JDBC driver 3.13.25 or later, review your Snowflake connection information for underscores.
When needed, you can bypass the default driver behavior by setting the
allowUnderscoresInHost
driver property to true
.
For more information and alternate solutions, see this Snowflake community article.
Update SDK Code for Database-Vendor-Specific JDBC Origins
In version 5.2.0, four origins - My SQL JDBC Table, Oracle JDBC Table, PostgreSQL JDBC Table, and SQL Server JDBC Table - replace the Table property that accepted a single table with the Tables property that accepts a list of multiple tables.
After upgrading from a version earlier than Transformer 5.2.0,
review your SDK for Python code for these origins and replace
origin.table
with origin.tables
.
Review Use of Generated XML Files
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
If using Transformer prebuilt with Scala 2.12, then after upgrading from a version earlier than Transformer 5.2.0, review how pipelines use the generated XML files and make necessary changes to account for the new initial declaration.
Install Java JDK 11 for Scala 2.12 Installations
Starting with version 4.1.0, when you use Transformer prebuilt with Scala 2.12, you must install Java JDK 11 on the Transformer machine. In previous releases, though required by Transformer prebuilt with Scala 2.12, a Java JDK 11 installation was not enforced.
If you upgrade from Transformer 4.0.0 to Transformer 4.1.0 or later prebuilt with Scala 2.12, you must have Java JDK 11 installed on the Transformer machine for Transformer to start.
Access Databricks Job Details
Starting with version 4.1.0, Transformer submits jobs to Databricks differently from previous releases. As a result, you have 60 days from a Databricks job run to view job details.
In previous releases, with each pipeline run, Transformer creates a standard Databricks job, but uses it only once. This job counts toward the Databricks jobs limit.
With version 4.1.0 and later, Transformer submits ephemeral jobs to Databricks. An ephemeral job runs only once, and does not count towards the Databricks job limit.
However, the details for the ephemeral jobs are only available for 60 days, and are not available through the Databricks jobs menu. When necessary, access Databricks job details while they are available. Pipeline details remain available through Transformer as before.
For information about accessing job details, see Accessing Databricks Job Details.
Update ADLS Stages in HDInsight Pipelines
Starting with version 4.1.0, to use an ADLS
Gen1 or Gen2 stage in a pipeline that runs on an Apache Spark for Azure HDInsight
cluster, you must configure the stage to use the ADLS cluster-provided
libraries
stage library.
Update ADLS stages in existing Azure HDInsight cluster pipelines to use the ADLS
cluster-provided libraries
stage library.
Update Keystore and Truststore Location
Starting with version 4.1.0, when you enable
HTTPS for Transformer, you
can store the keystore and truststore files in the Transformer
resources directory, <installation_dir>/externalResources/resources
. You can then enter a
path relative to that directory when you define the keystore and truststore location in
the Transformer
configuration properties.
In previous releases, you could store the keystore and truststore files in the Transformer
configuration directory, <installation_dir>/etc
, and then define the location to
the file using a path relative to that directory. You can continue to store the file in
the configuration directory, but StreamSets
recommends moving it to the resources directory when you upgrade.