Post Upgrade Tasks

Update Control Hub On-Premises

By default, Control Hub on-premises can work with any version of registered Transformers to the current version of Control Hub. If you use Control Hub on-premises and you upgrade registered Transformers to a version higher than your current version of Control Hub, you must modify the Transformer version range within your Control Hub installation.

For example, if you use Control Hub on-premises version 3.12.0 and you upgrade registered Transformers to version 6.1.0, you must update the maximum Transformer version that can work with Control Hub. As a best practice, configure the maximum Transformer version to 6.99.999 to ensure that Transformer upgrades to later minor versions, such as 6.1.1 or 6.1.2, will continue to work with Control Hub.

Note: If you register Transformer version 3.16.x or later with Control Hub on-premises version 3.18.x or earlier, then some stages in the Control Hub Pipeline Designer display a Connection property that is not supported. Do not change the property from the default value of None. If you select Choose Value or use a parameter to define the property, Pipeline Designer hides the remaining connection properties and the pipeline fails to run.

Log in to Control Hub as the default system administrator - the admin@admin user account.
In the Navigation panel, click Administration > Transformers.
Click the Component Version Range icon: .
Enter the maximum Transformer version that can work with Control Hub, such as 6.99.999.

Review Amazon Stages that Assume Roles

Starting with 6.1.0, you can configure a region and endpoint for the Assume Role property in Amazon stages.

In previous versions, the assumed role uses the same regional endpoint configured for the Amazon stage. Upgraded stages configured to assume a role use the region and endpoint for the assumed role from the AWS service it connects to. If those are not available, the assumed role uses the global endpoint for assumed roles.

After upgrading to version 6.1.0, review upgraded pipelines to ensure that Amazon stages configured to assume a role are configured with the appropriate region and endpoint.

Upgrade Clusters and Transformer to Spark 3.3 or Later

Starting with 6.0.0, Transformer no longer supports Spark 2.x, Scala 2.11, and Java JDK 8. Spark versions 3.0 - 3.2 are also deprecated with this release.

Since Spark 2.x was supported by Transformer prebuilt with Scala 2.11, which requires Java JDK 8, this means that Transformer prebuilt with Scala 2.11 is no longer available, and Spark 2.x and Java JDK 8 are no longer supported.

To use Transformer 6.0.0, you must use Transformer prebuilt with Scala 2.12, which requires Spark 3.x and Java JDK 11.

Transformer 6.0.0 pipelines can only run on Spark 3.x clusters. However, pipelines built on later Transformer versions can run on clusters using any Spark version that the Transformer version supports. For more information, see Choosing an Installation Package.

After you migrate your workload to clusters using a supported Spark version, make the following corresponding Transformer updates:

Install a Transformer engine that is compatible with the Spark and Scala version on the new cluster, then register it with Control Hub: Transformer prebuilt with Scala 2.12, or later when available.
Edit pipelines:
1. Update each pipeline to use the new Transformer as the authoring engine.
2. On the Cluster tab, enter the cluster information for the new cluster.
3. If related stages use Transformer-provided libraries, update the stage library to use an appropriate version.
  If stages use cluster-provided stage libraries, you can skip this step.
Update the execution engine labels in associated jobs so they run on the new Transformer.

Review Pipelines with Removed Functionality

With version 6.0.0, a number of cluster types, supported versions, stages, and other functionality that was previously deprecated have been removed. The functionality has been removed because it is not commonly used, has been replaced, is no longer maintained, or connects to a system with an end-of-service date.

The following tables provide additional details and possible alternatives for the removed functionality:


Removed Support	Details / Alternatives
Spark 2.x, Scala 2.11, and Java JDK 8	Since Spark 2.x was supported by Transformer prebuilt with Scala 2.11, which requires Java JDK 8, Transformer prebuilt with Scala 2.11 is no longer available, and Spark 2.x and Java JDK 8 are no longer supported. When needed, you can continue using Transformer prebuilt with Scala 2.11 with earlier versions of Transformer, however maintenance updates for Transformer prebuilt with Scala 2.11 will no longer occur. If feasible, upgrade to Spark 3.x and Transformer prebuilt with Scala 2.12, which requires Java JDK 11. Support for Spark 4.x through Transformer prebuilt with Scala 2.13 is planned for a future release. For information about Scala and Spark support, see Scala, Spark, and Java JDK Requirements. For upgrade information, see Upgrade Clusters and Transformer to Spark 3.3 or Later.


Removed Cluster Support	Details / Alternatives
Amazon EMR 5.20.0 and later 5.x 6.1.x - 6.7.x	Amazon states that versions more than two years old are end-of-life. Upgrade to a supported version of Amazon EMR.
Databricks 5.x - 8.x 9.1 10.4	Databricks has specified end-of-life and end-of-support dates for many versions. Upgrade to a supported version of Databricks.
Google Dataproc 1.x - 2.0.x	Google states that versions more than two years old are end-of-life. Upgrade to a supported version of Google Dataproc.
Hadoop YARN - Cloudera: CDE 1.3.x for Spark 2.x CDH 5.x - 6.x CDP 7.1.x for Spark 2.x HDP 3.1.1.1	Cloudera has specified an end-of-life timeline for Cloudera Enterprise and Hortonworks Data Platform products. Upgrade to a supported version of CDP Private Cloud Base or Cloudera Data Engineering.
Hadoop YARN - MapR	Previously supported MapR versions are Spark 2.x clusters, which are no longer supported by Transformer. No additional MapR versions are supported at this time.
Microsoft: Apache Spark for Azure HDInsight 4.0 SQL Server 2019 Big Data Clusters	Microsoft has specified an end-of-life date for Azure HDInsight 4.0 and a retirement date for SQL Server Big Data Clusters. For alternatives, see the Microsoft documentation.


Removed Feature	Details / Alternatives
Transformer user interface	Use the Control Hub user interface to design and run pipelines.
Advanced Error Handling pipeline property	The JDBC Table origin can no longer include the SQL query and results in the Transformer log. You can view those details in your database.
Elasticsearch destination	The Elasticsearch destination was supported only on Spark 2.x clusters, which are no longer supported by Transformer.
All MapR stages	MapR stages were supported on MapR cluster versions that are no longer supported by Transformer.
Stage libraries	The following stage libraries have been removed because they are for Spark cluster versions that are no longer supported or are deprecated with this release: AWS Transformer-provided libraries for Hadoop 2.7.7 Azure SQL for Spark 2.4.x Azure SQL for Spark 3.0.x Azure SQL for Spark 3.1.x Google Cloud Hadoop 2.x libraries Snowflake Transformer-provided libraries for Spark 2.x Snowflake Transformer-provided libraries for Spark 3.0 Snowflake Transformer-provided libraries for Spark 3.1 Snowflake Transformer-provided libraries for Spark 3.2

Review Dockerfiles for Custom Docker Images

Starting with 6.0.0, Transformer Docker images use RedHat UBI9 OpenJDK 11 as the parent image.

In previous releases, Transformer Docker images used eclipse-temurin 11.0.22_7-jdk as the parent image.

If you build custom Transformer images with earlier releases of streamsets/transformer as the parent image, review your Dockerfiles. Make all required updates so they are compatible with RedHat UBI9 OpenJDK 11 before you build a custom image based on Docker images for Transformer 6.0.0: streamsets/transformer:scala-2.12_6.0.0.

Review Delta Lake Stages Using Transformer-Provided Libraries

Starting with 5.7.0, the Transformer-provided libraries for Delta Lake stages have been upgraded from Delta Lake version 0.7.0 to version 2.4.0, which supports Spark 3.4.x.

After upgrading from an earlier Transformer version, review pipelines that include Delta Lake stages that use Transformer-provided libraries to ensure that the Spark upgrade does not adversely affect data processing.

For information about Delta Lake versions and Spark compatibility, see the Databricks documentation.

Review Drivers for Azure SQL Destinations

Starting with 5.6.0, the Azure SQL destination requires a Microsoft JDBC driver for SQL Server version 8 or later. In addition, Transformer also includes a Microsoft JDBC driver for SQL Server with the destination.

After upgrading from an earlier Transformer version, if an existing Azure SQL pipeline uses an earlier driver version and is configured to perform a bulk copy, the pipeline may fail with a java.lang.NoSuchMethodError error.

To address the error, perform one of the following tasks, as appropriate:

If the older driver is installed on the cluster, remove the driver or upgrade it to version 8 or later. For more information, see the Microsoft documentation.
If the older driver is installed as an external library for the Azure SQL destination, remove the existing version of the external library from Transformer to prevent Transformer from uploading it to the cluster. For more information, see Managing External Libraries.

Review Lookup Processors

Starting with 5.5.0, lookup processors use an internal row_number column to perform lookups and repartition data. This change applies to all lookup processors: Delta Lake Lookup, JDBC Lookup, and Snowflake Lookup.

Due to the new internal column, with this release, a lookup processor configured to sort columns and to return the first matching row can no longer include a row_number column in stage properties.

The same lookup processor also now creates partitions based on the specified lookup fields.

After upgrading from an earlier Transformer version, perform the following tasks for existing lookup processors that have the Column to Sort and Sort Order properties defined, and the Lookup Behavior property set to Return the First Matching Row:

Ensure that the processor does not include a column named row_number in stage properties.
Verify that downstream pipeline processing is not negatively affected by the partitions created for the fields in the Lookup Field properties.

Review Dockerfiles for Custom Docker Images

Starting with version 5.4.0, the Transformer Docker images use Ubuntu 22.04 LTS (Jammy Jellyfish) as the parent image.

In previous releases, the Transformer Docker images used Alpine Linux as a parent image.

If you build custom Transformer images using a Docker image for Transformer 5.3.x or earlier as the parent image, review your Dockerfiles. Make all required updates so they are compatible with Ubuntu Jammy Jellyfish before you build a custom image based on Docker images for Transformer 5.4.0, streamsets/transformer:scala-2.11_5.4.0 or streamsets/transformer:scala-2.12_5.4.0, or later versions.

Manage Underscores in Snowflake Connection Information

Starting with the Snowflake JDBC driver 3.13.25 release in November 2022, the Snowflake JDBC driver converts underscores to hyphens, by default. This can adversely affect communicating with Snowflake when Snowflake connection information specified in a Snowflake stage or connection, such as a URL, includes underscores.

After you upgrade to Snowflake JDBC driver 3.13.25 or later, review your Snowflake connection information for underscores.

When needed, you can bypass the default driver behavior by setting the allowUnderscoresInHost driver property to true. For more information and alternate solutions, see this Snowflake community article.

Update SDK Code for Database-Vendor-Specific JDBC Origins

In version 5.2.0, four origins - My SQL JDBC Table, Oracle JDBC Table, PostgreSQL JDBC Table, and SQL Server JDBC Table - replace the Table property that accepted a single table with the Tables property that accepts a list of multiple tables.

After upgrading from a version earlier than Transformer 5.2.0, review your SDK for Python code for these origins and replace origin.table with origin.tables.

Review Use of Generated XML Files

In version 5.2.0 of Transformer prebuilt with Scala 2.12, the XML files that destinations write include the following initial XML declaration:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

If using Transformer prebuilt with Scala 2.12, then after upgrading from a version earlier than Transformer 5.2.0, review how pipelines use the generated XML files and make necessary changes to account for the new initial declaration.

Install Java JDK 11 for Scala 2.12 Installations

Starting with version 4.1.0, when you use Transformer prebuilt with Scala 2.12, you must install Java JDK 11 on the Transformer machine. In previous releases, though required by Transformer prebuilt with Scala 2.12, a Java JDK 11 installation was not enforced.

If you upgrade from Transformer 4.0.0 to Transformer 4.1.0 or later prebuilt with Scala 2.12, you must have Java JDK 11 installed on the Transformer machine for Transformer to start.

Access Databricks Job Details

Starting with version 4.1.0, Transformer submits jobs to Databricks differently from previous releases. As a result, you have 60 days from a Databricks job run to view job details.

In previous releases, with each pipeline run, Transformer creates a standard Databricks job, but uses it only once. This job counts toward the Databricks jobs limit.

With version 4.1.0 and later, Transformer submits ephemeral jobs to Databricks. An ephemeral job runs only once, and does not count towards the Databricks job limit.

However, the details for the ephemeral jobs are only available for 60 days, and are not available through the Databricks jobs menu. When necessary, access Databricks job details while they are available. Pipeline details remain available through Transformer as before.

For information about accessing job details, see Accessing Databricks Job Details.

Update ADLS Stages in HDInsight Pipelines

Starting with version 4.1.0, to use an ADLS Gen1 or Gen2 stage in a pipeline that runs on an Apache Spark for Azure HDInsight cluster, you must configure the stage to use the ADLS cluster-provided libraries stage library.

Update ADLS stages in existing Azure HDInsight cluster pipelines to use the ADLS cluster-provided libraries stage library.

Update Keystore and Truststore Location

Starting with version 4.1.0, when you enable HTTPS for Transformer, you can store the keystore and truststore files in the Transformer resources directory, $TRANSFORMER_RESOURCES. You can then enter a path relative to that directory when you define the keystore and truststore location in the Transformer configuration properties.

In previous releases, you could store the keystore and truststore files in the Transformer configuration directory, $TRANSFORMER_CONF, and then define the location to the file using a path relative to that directory. You can continue to store the file in the configuration directory, but moving it to the resources directory when you upgrade is recommended.

Update Kubernetes Pipelines

Starting with version 4.0.0, Kubernetes clusters are no longer supported.

If you have pipelines that run on Kubernetes clusters, update them to run on a supported cluster type.

Update Drivers on Older Hadoop Clusters

When you run pipelines on older distributions of Hadoop clusters, the cluster can have an older JDBC driver on the classpath that takes precedence over the JDBC driver required for the pipeline. This can be a problem for PostgreSQL and SQL Server JDBC drivers.

When a pipeline encounters this issue, it generates a SQLFeatureNotSupportedException error, such as:

java.sql.SQLFeatureNotSupportedException: This operation is not supported.

To avoid this issue, update the PostgreSQL and SQL Server JDBC drivers on the cluster to the latest available versions.

Enable the Spark External Shuffle Service

Starting with version 3.14.0, Transformer requires Spark clusters to have the Spark external shuffle service enabled.

Most Spark clusters have the external shuffle service enabled by default. However, Hortonworks clusters do not.

After you upgrade to version Transformer 3.14.0 or later, ensure that all Hortonworks clusters have the Spark external shuffle service enabled. Other clusters may also require enabling the service.

For more information, see Spark Shuffle Service Requirement.

Update Spark SQL Expression Pipelines

Starting with version 3.14.0, the Spark NullType data type is no longer supported.

If upgraded pipelines include Spark SQL Expression processors that use or result in a null value, update those processors so they do not use the NullType data type:

If the expression might result in a null value, you can use the new Cast To property to select the data type to use.
If the expression logic uses null values, add a cast call to ensure that the null values are handled properly.