Release Notes

5.7.x Release Notes

The Transformer 5.7.0 release occurred on February 26, 2024.

New Features and Enhancements

Clusters
  • Databricks 13.3 support - You can now run pipelines on Databricks 13.3 clusters.
  • External libraries on Databricks, Dataproc, EMR, and EMR Serverless clusters - Transformer fully manages staging directories for these clusters. When you add, remove, or update external libraries as external resources on Transformer, Transformer now automatically updates the cluster staging directories when you run a related pipeline.

    Previously, Transformer only updated these cluster staging directories when you added external libraries or updated external libraries with the same base file name. Now, Transformer compares the external libraries installed on Transformer against cluster staging directories and updates these directories to match, as needed.

  • EMR clusters
    • AWS Service Catalog support - You can configure an EMR pipeline to run on a cluster provisioned by AWS Service Catalog.

    • Define Cluster Start Option property - Use this property to specify whether the pipeline runs on an existing cluster, a cluster provisioned by Transformer, or a cluster provisioned by AWS Service Catalog.

      In previous releases, you used the Provision a New Cluster property to specify whether the pipeline ran on an existing cluster or a cluster provisioned by Transformer. This update does not require changes to existing pipelines.

Connections
Amazon EMR Cluster connections include the same changes as EMR clusters:
  • You can configure a connection to run a job on a cluster provisioned by AWS Service Catalog.
  • The new Define Cluster Start Option property allows you to specify whether the pipeline runs on an existing cluster, a cluster provisioned by Transformer, or a cluster provisioned by AWS Service Catalog.

    Previously, you used the Provision a New Cluster property to specify whether the pipeline ran on an existing cluster or a cluster provisioned by Transformer. This update does not require changes to existing pipelines.

Transformer driver callback URL

Transformer includes a new driver callback URL property, transformer.driver.callback.url. This property defines the cluster callback URL for Spark to communicate with Transformer. Transformer uses the specified callback URL for all pipelines, unless overridden by the existing Cluster Callback URL property defined in individual pipelines.

In previous releases the specified Transformer base URL, transformer.base.http.url, which defines how Control Hub communicates with Transformer, also acted as the cluster callback URL.

This enhancement is not a behavior change. However, if you previously configured the Cluster Callback URL property in pipelines as a workaround to avoid using the Transformer base URL as the cluster callback URL, you can now simply define the new driver cluster callback URL Transformer property.

However, note that the Cluster Callback URL pipeline property, when defined, takes precedence over all other possible URLs. For more information, see Understanding the Spark Cluster Callback URL.

Stages and libraries
  • Library support:
    • Delta Lake 2.4.0 support - Transformer-provided libraries for Delta Lake stages have been upgraded from Delta Lake version 0.7.0 to version 2.4.0, which supports Spark 3.4.x. This change can have upgrade impact.
    • Amazon support for Hadoop 3.3.4 - Transformer provides AWS Transformer-provided libraries for Hadoop 3.3.4 for Amazon S3 stages.
  • Deprecated stages - The ADLS Gen1 origin and destination have been deprecated with this release and may be removed in a future release. We recommend that you avoid using these stages. For suggested alternatives, see Deprecated Functionality.
Additional updates
  • Docker image upgrades - The Docker images for Transformer 5.7.0 include upgraded Spark versions, as follows:
    • streamsets/transformer:scala-2.11_5.7.0 now uses Spark 2.4.8, upgraded from Spark 2.4.5.

    • streamsets/transformer:scala-2.12_5.7.0 now uses Spark 3.4.1, upgraded from Spark 3.0.1.

  • Ludicrous input metrics - When a pipeline runs in Ludicrous processing mode, Transformer provides both input and output statistics.

    In previous releases, Transformer provided only output statistics by default, but you could enable the generation of input statistics as needed. With this update, the Collect Input Metrics pipeline property has been removed since input metrics are now always generated in Ludicrous mode.

Upgrade Impact

Delta Lake stages that use Transformer-provided libraries
With this release, the Transformer-provided libraries for Delta Lake stages have been upgraded from Delta Lake version 0.7.0 to version 2.4.0, which supports Spark 3.4.x. Review pipelines that include Delta Lake stages that use Transformer-provided libraries to ensure that the Spark upgrade does not adversely affect data processing.
For information about Delta Lake versions and Spark compatibility, see the Databricks documentation.

5.7.0 Fixed Issues

  • When you restart Transformer, Transformer does not correctly report the status of pipelines that have been running on MapR and CDH clusters, which can lead to the incorrect belief that the pipelines have stopped. With this fix, Transformer correctly reconnects to MapR and CDH clusters and provides accurate pipeline status reports.
  • Transformer does not provide access to Spark driver logs for pipelines that run on EMR Serverless applications, existing EMR clusters, or provisioned EMR clusters that store the driver logs outside of Amazon S3.

5.7.x Known Issues

  • If you restart Transformer, then force stop a pipeline that runs on Spark Standalone cluster or a MapR cluster with security enabled, Transformer can indicate that the pipeline has been stopped even though the pipeline continues to run.

    Workaround: Use Spark or YARN monitoring tools to track and manage those pipelines.

  • When used in a Dataproc 2.1 cluster, the BigQuery origin can cause the pipeline to fail with the following error when the Max Readers property is set to a value between 1-5, inclusive:
    preferred_min_stream_count must be less than or equal to max_stream_count

    Workaround: When possible, set the Max Readers property to 0 or a value greater than 6.

  • Pipelines that run on an EMR Serverless application and contain a Snowflake Lookup processor generate a validation error.
  • In Databricks 11.3 clusters, Transformer does not support Oracle 19 databases in JDBC stages.
  • The Snowflake destination fails when attempting to create a new table and the destination is configured as follows:
    • Column Mapping Mode property is set to "By Name"
    • Write Mode property is set to "Append rows to existing table or create table if none exists"
    The pipeline produces the following error:
    Caused by: net.snowflake.client.jdbc.SnowflakeSQLException: SQL compilation error:
    Object 'X_DB.X_SCHEMA.X_TABLE' does not exist or not authorized

    Workaround: Create the table before running the pipeline.

  • Pipelines that run on a Spark 3.x Cloudera Data Engineering cluster can fail when they include a File stage that reads from or writes to HDFS with Kerberos enabled.
  • When trying to access Amazon Redshift or Microsoft SQL Server from a Google Dataproc pipeline, the pipeline fails due to a known issue with the default Dataproc Java security provider, Conscrypt.

    SQL Server pipelines fail with the following error:

    Last callback from Transformer Spark application exceeds the specified time configured in 'transformer.driver.max.inactive.interval' : 30000 ms. Check logs for error.
    Redshift pipelines fail with the following error:
    Could not connect to Redshift DB due to error: [JDBC Driver] null

    Workaround: Use one of the following methods to disable Conscrypt:

    • For existing clusters, on all cluster nodes, edit the $JAVA_HOME/jre/lib/security/java.security security configuration file, and disable the org.conscrypt.OpenSSLProvider property.
    • For new clusters, provision clusters with the dataproc:dataproc.conscrypt.provider.enable property set to false.
  • When previewing a Dataproc pipeline that is configured both to provision and to terminate a cluster, Transformer fails to terminate the cluster when preview completes.
  • At this time, you cannot preview data using the cluster manager configured for the pipeline when the pipeline includes Delta Lake, Hive, or MapR Hive stages. Pipelines containing Hive and MapR Hive stages can produce results, but use the metastore URI in the Hive configuration file, ignoring the optional Metastore URI stage property.
  • Pipelines that require passing resources such as private key files to Spark executors fail to run on any cluster type except EMR, where this issue has been fixed.

5.6.x Release Notes

The Transformer 5.6.0 release occurred on October 26, 2023.

New Features and Enhancements

Assume role external ID support
Amazon S3 stages, EMR pipelines, and EMR Serverless pipelines allow you to specify an external ID to use when assuming another role.
Azure SQL destination
  • The destination is available for use with Spark 3.0.x and later, except for version 3.2.x.
  • The destination includes the following new properties for performing a bulk copy:
    • Reliability Level
    • Isolation Level
    • Schema Check
  • The Auto-Create Table property, which was available when performing a bulk copy, has been removed. It was unnecessary because the destination always creates a table as needed when performing a bulk copy.
  • The Azure SQL destination includes a Microsoft JDBC driver for SQL Server with the destination. The destination uses Microsoft JDBC driver for SQL Server version 8 or later. The destination also requires version 8 or later of the driver starting with this release. This change can have upgrade impact.
Connections
EMR and EMR Serverless connections also allow you to specify an external ID to use when assuming another role.

Upgrade Impact

Update older Microsoft drivers for Azure SQL pipelines
Starting with 5.6.0, the Azure SQL destination requires a Microsoft JDBC driver for SQL Server version 8 or later. Transformer also includes a Microsoft JDBC driver for SQL Server with the destination, starting with this release.
If an existing Azure SQL pipeline uses an earlier driver version and is configured to perform a bulk copy, the pipeline may fail with a java.lang.NoSuchMethodError error.
To address the error, perform one of the following tasks, as appropriate:
  • If the older driver is installed on the cluster, remove the driver or upgrade it to version 8 or later. For more information, see the Microsoft documentation.

  • If the older driver is installed as an external library for the Azure SQL destination, remove the existing version of the external library from Transformer to prevent Transformer from uploading it to the cluster. For more information, see Managing External Libraries.

5.6.0 Fixed Issues

  • Runtime parameters and runtime properties are not correctly evaluated when used to specify the bucket and path in Amazon S3 destinations.

  • Long running pipelines can cause memory issues with Transformer.

  • Sometimes, Force Stop does not kill pipelines that run on Hadoop YARN clusters.

  • The Partition Columns property in file-based origins includes an empty partition column option by default.

5.6.x Known Issues

  • When used in a Dataproc 2.1 cluster, the BigQuery origin can cause the pipeline to fail with the following error when the Max Readers property is set to 0 or to a value greater than 6:
    preferred_min_stream_count must be less than or equal to max_stream_count

    Workaround: When possible, set the Max Readers property to a value between 1 and 5, inclusive.

  • Pipelines that run on an EMR Serverless application and contain a Snowflake Lookup processor generate a validation error.
  • In Databricks 11.3 clusters, Transformer does not support Oracle 19 databases in JDBC stages.
  • The Snowflake destination fails when attempting to create a new table and the destination is configured as follows:
    • Column Mapping Mode property is set to "By Name"
    • Write Mode property is set to "Append rows to existing table or create table if none exists"
    The pipeline produces the following error:
    Caused by: net.snowflake.client.jdbc.SnowflakeSQLException: SQL compilation error:
    Object 'X_DB.X_SCHEMA.X_TABLE' does not exist or not authorized

    Workaround: Create the table before running the pipeline.

  • Pipelines that run on a Spark 3.x Cloudera Data Engineering cluster can fail when they include a File stage that reads from or writes to HDFS with Kerberos enabled.
  • When trying to access Amazon Redshift or Microsoft SQL Server from a Google Dataproc pipeline, the pipeline fails due to a known issue with the default Dataproc Java security provider, Conscrypt.

    SQL Server pipelines fail with the following error:

    Last callback from Transformer Spark application exceeds the specified time configured in 'transformer.driver.max.inactive.interval' : 30000 ms. Check logs for error.
    Redshift pipelines fail with the following error:
    Could not connect to Redshift DB due to error: [JDBC Driver] null

    Workaround: Use one of the following methods to disable Conscrypt:

    • For existing clusters, on all cluster nodes, edit the $JAVA_HOME/jre/lib/security/java.security security configuration file, and disable the org.conscrypt.OpenSSLProvider property.
    • For new clusters, provision clusters with the dataproc:dataproc.conscrypt.provider.enable property set to false.
  • When previewing a Dataproc pipeline that is configured both to provision and to terminate a cluster, Transformer fails to terminate the cluster when preview completes.
  • At this time, you cannot preview data using the cluster manager configured for the pipeline when the pipeline includes Delta Lake, Hive, or MapR Hive stages. Pipelines containing Hive and MapR Hive stages can produce results, but use the metastore URI in the Hive configuration file, ignoring the optional Metastore URI stage property.
  • Pipelines that require passing resources such as private key files to Spark executors fail to run on any cluster type except EMR, where this issue has been fixed.

5.5.x Release Notes

The Transformer 5.5.0 release occurred on July 11, 2023.

New Features and Enhancements

Clusters
Stages
Pipelines
  • Pipeline retry properties - Use the following properties to specify how Transformer tries to start pipelines that fail to start:
    • Retry Pipeline on Error - Enables trying to start a pipeline again after it fails to start.
    • Retry Attempts - Number of times to try to start a pipeline that fails to start. The default is 3.
  • Field selection - When a property includes a magnifying glass icon, you can use preview data to select fields instead of typing field names:
    • After previewing data, stages provide autocomplete suggestions for field names in properties that accept multiple fields.
    • After selecting fields, you can drag the fields to change the order in which they appear.
    • In some stages, after you select one or more fields to use, you can then enter an expression that includes the field name or enter additional field names.
Deprecation and testing update

Due to end-of-life declarations, the following cluster versions are deprecated with this release and are no longer tested with Transformer:

  • Azure HDInsight 4.0
  • CDH 6.x
  • Databricks 5.x - 8.x
  • HDP
  • SQL Server 2019 Big Data Clusters

Upgrade Impact

Review lookup processor pipelines that sort columns and return the first matching row
In previous releases, when a lookup processor performs lookups on large data sets, the lookup fails. This issue is fixed with this release, but requires lookup processors to use an internal row_number column.
Due to the new internal column, with this release, a lookup processor configured to sort columns and to return the first matching row can no longer include a row_number column in stage properties.
The same lookup processor also now creates partitions based on the specified lookup fields.
After upgrading to 5.5.0, perform the following tasks for existing lookup processors that have the Column to Sort and Sort Order properties defined, and the Lookup Behavior property set to Return the First Matching Row:
  • Ensure that the processor does not include a column named row_number in stage properties.
  • Verify that downstream pipeline processing is not negatively affected by the partitions created for the fields in the Lookup Field properties.

5.5.0 Fixed Issues

  • Transformer logs include sensitive information in Debug mode.
  • Lookups on large data sets can fail.
  • The Surrogate Key Generator processor loses track of the last-saved offset after processing an empty batch.
  • Transformer cannot locate a separate runtime properties file that has been uploaded as an external resource for the engine.

5.5.x Known Issues

  • Pipelines that run on an EMR Serverless application and contain a Snowflake Lookup processor generate a validation error.
  • In Databricks 11.3 clusters, Transformer does not support Oracle 19 databases in JDBC stages.
  • The Snowflake destination fails when attempting to create a new table and the destination is configured as follows:
    • Column Mapping Mode property is set to "By Name"
    • Write Mode property is set to "Append rows to existing table or create table if none exists"
    The pipeline produces the following error:
    Caused by: net.snowflake.client.jdbc.SnowflakeSQLException: SQL compilation error:
    Object 'X_DB.X_SCHEMA.X_TABLE' does not exist or not authorized

    Workaround: Create the table before running the pipeline.

  • Pipelines that run on a Spark 3.x Cloudera Data Engineering cluster can fail when they include a File stage that reads from or writes to HDFS with Kerberos enabled.
  • When trying to access Amazon Redshift or Microsoft SQL Server from a Google Dataproc pipeline, the pipeline fails due to a known issue with the default Dataproc Java security provider, Conscrypt.

    SQL Server pipelines fail with the following error:

    Last callback from Transformer Spark application exceeds the specified time configured in 'transformer.driver.max.inactive.interval' : 30000 ms. Check logs for error.
    Redshift pipelines fail with the following error:
    Could not connect to Redshift DB due to error: [JDBC Driver] null

    Workaround: Use one of the following methods to disable Conscrypt:

    • For existing clusters, on all cluster nodes, edit the $JAVA_HOME/jre/lib/security/java.security security configuration file, and disable the org.conscrypt.OpenSSLProvider property.
    • For new clusters, provision clusters with the dataproc:dataproc.conscrypt.provider.enable property set to false.
  • When previewing a Dataproc pipeline that is configured both to provision and to terminate a cluster, Transformer fails to terminate the cluster when preview completes.
  • At this time, you cannot preview data using the cluster manager configured for the pipeline when the pipeline includes Delta Lake, Hive, or MapR Hive stages. Pipelines containing Hive and MapR Hive stages can produce results, but use the metastore URI in the Hive configuration file, ignoring the optional Metastore URI stage property.
  • Pipelines that require passing resources such as private key files to Spark executors fail to run on any cluster type except EMR, where this issue has been fixed.

5.4.x Release Notes

The Transformer 5.4.0 release occurred on April 27, 2023.

New Features and Enhancements

Clusters
  • Amazon EMR - You can specify a cluster using the cluster name and tags, instead of using the cluster ID.
  • Amazon EMR Serverless - You can specify an application using the application name and tags, instead of using the application ID.
  • Amazon EMR and EMR Serverless - You can configure the following Transformer configuration properties to define how Transformer retries monitoring EMR and EMR Serverless pipelines:
    • transformer.emr.monitoring.max.retry

    • transformer.emr.monitoring.retry.base.backoff

    • transformer.emr.monitoring.retry.max.backoff

  • Databricks enhancements:
    • You can configure the transformer.databricks.external.resources.cache Transformer configuration property to cache runtime resource files for reuse.
    • You can configure the following Transformer configuration properties to define how Transformer retries starting Databricks pipelines:
      • transformer.databricks.run.max.retries
      • transformer.databricks.run.retry.interval
    • When needed, you can configure Transformer to lock the Databricks workspace when it starts a pipeline. This can prevent timeout errors when multiple Transformer engines try to start pipelines that require uploading a large volume of runtime resource files to staging areas.
    • Amazon EMR, Amazon EMR Serverless, Databricks, and Dataproc clusters - When you update an external library used by these clusters for Transformer, Transformer can update the external libraries in the cluster staging directories, so you do not need to manually remove older versions.
Connections
  • Amazon EMR - You can specify a cluster using the cluster name and tags, instead of using the cluster ID. Available with authoring Data Collector 5.5.0 or later.
  • Amazon EMR Serverless - You can specify an application using the application name and tags, instead of using the application ID. Available with authoring Data Collector 5.5.0 or later.
Additional enhancements
  • Transformer Docker image - The Docker images for Transformer 5.4.0, streamsets/transformer:scala-2.11_5.4.0 and streamsets/transformer:scala-2.12_5.4.0, use Ubuntu 22.04 LTS (Jammy Jellyfish) as a parent image. This change can have upgrade impact.

Upgrade Impact

Review Dockerfiles for custom Docker images

In previous releases, Transformer Docker images used Alpine Linux as a parent image. Due to limitations in Alpine Linux, with this release Transformer Docker images use Ubuntu 22.04 LTS (Jammy Jellyfish) as the parent image.

If you build custom Transformer images with earlier releases of streamsets/transformer as the parent image, review your Dockerfiles. Make all required updates so they are compatible with Ubuntu Jammy Jellyfish before you build a custom image based on streamsets/transformer:scala-2.11_5.4.0 or streamsets/transformer:scala-2.12_5.4.0.

5.4.0 Fixed Issues

  • When returning an empty dataframe, some processors, such as the Filter processor, do not include the schema as expected.
  • The Surrogate Key Generator processor restarts key generation using the configured initial value after receiving an empty dataframe.
  • When a provisioned pipeline completes successfully on Databricks, Databricks can incorrectly display a Cancelled or Failed status instead of a Succeeded status.

5.4.x Known Issues

  • Pipelines that run on an EMR Serverless application and contain a Snowflake Lookup processor generate a validation error.
  • In Databricks 11.3 clusters, Transformer does not support Oracle 19 databases in JDBC stages.
  • Transformer cannot locate a separate runtime properties file that has been uploaded as an external resource for the engine.

    Workaround: Define runtime properties in the Transformer configuration properties instead of in a separate runtime properties file.

  • The Snowflake destination fails when attempting to create a new table and the destination is configured as follows:
    • Column Mapping Mode property is set to "By Name"
    • Write Mode property is set to "Append rows to existing table or create table if none exists"
    The pipeline produces the following error:
    Caused by: net.snowflake.client.jdbc.SnowflakeSQLException: SQL compilation error:
    Object 'X_DB.X_SCHEMA.X_TABLE' does not exist or not authorized

    Workaround: Create the table before running the pipeline.

  • Pipelines that run on a Spark 3.x Cloudera Data Engineering cluster can fail when they include a File stage that reads from or writes to HDFS with Kerberos enabled.
  • When trying to access Amazon Redshift or Microsoft SQL Server from a Google Dataproc pipeline, the pipeline fails due to a known issue with the default Dataproc Java security provider, Conscrypt.

    SQL Server pipelines fail with the following error:

    Last callback from Transformer Spark application exceeds the specified time configured in 'transformer.driver.max.inactive.interval' : 30000 ms. Check logs for error.
    Redshift pipelines fail with the following error:
    Could not connect to Redshift DB due to error: [JDBC Driver] null

    Workaround: Use one of the following methods to disable Conscrypt:

    • For existing clusters, on all cluster nodes, edit the $JAVA_HOME/jre/lib/security/java.security security configuration file, and disable the org.conscrypt.OpenSSLProvider property.
    • For new clusters, provision clusters with the dataproc:dataproc.conscrypt.provider.enable property set to false.
  • When previewing a Dataproc pipeline that is configured both to provision and to terminate a cluster, Transformer fails to terminate the cluster when preview completes.
  • At this time, you cannot preview data using the cluster manager configured for the pipeline when the pipeline includes Delta Lake, Hive, or MapR Hive stages. Pipelines containing Hive and MapR Hive stages can produce results, but use the metastore URI in the Hive configuration file, ignoring the optional Metastore URI stage property.
  • Pipelines that require passing resources such as private key files to Spark executors fail to run on any cluster type except EMR, where this issue has been fixed.

5.3.x Release Notes

The Transformer 5.3.0 release occurred on February 17, 2023.

New Features and Enhancements

Clusters
  • Amazon EMR Serverless - Transformer can run pipelines on an Amazon EMR Serverless application.
  • Databricks clusters
    • You can include the Google Big Query origin and destination and the Google Cloud Storage origin and destination in pipelines that you run on a Databricks cluster.
    • You can run pipelines on a Databricks 11.3 cluster.
  • Hadoop YARN clusters - You can run pipelines on a MapR 7.0 distribution of Hadoop YARN. The distribution requires Ezmeral Ecosystem Pack (EEP) 8.1, which includes Spark 3.2.0. Map R is now called HPE Ezmeral Data Fabric.
Stages
  • Data Formats property - The Whole Directory and MapR FS origins have an Additional Data Format Configuration property on the Data Formats tab. You can use this property to enter other data format parameters.
  • Google stages - You can include the Google Big Query origin and destination and the Google Cloud Storage origin and destination in pipelines that you run on a Databricks cluster.
  • Slowly Changing Dimension processor - Improved logic produces consistent and expected results in the following existing features:
    • Null handling - When enabled, the processor replaces null values in change records.
    • Records without changes - The processor discards and change record that does not contain a change.
    • Tracking fields - Change records do not require tracking fields.
    • Timestamp basis field - The processor supports three categories for the timestamp basis field:
      • Same name as tracking field
      • Data field in master dimension
      • Extra field in change record
    • Non-Type 2 updates - In a Type 2 dimension, a change record can trigger a non-Type 2 update.
    • New records - For change records without a matching master record, the processor inserts a new record
Connections
Additional enhancements
  • Thycotic Secret Server credential store - You can use Thycotic Secret Server as a credential store for Transformer.
  • Support bundles - You can generate a support bundle when Transformer uses the default WebSocket tunneling communication method.

    Earlier Transformer versions require using direct engine REST APIs to generate support bundles.

Documentation terminology
  • In the documentation that discusses slowly changing file dimensions, the terms “grouped file dimension” and “ungrouped file dimension” replace the terms “partitioned file dimension” and “unpartitioned file dimension.”

Upgrade Impact

Review account types
With this release, Transformer no longer supports StreamSets accounts. If you were using a StreamSets account with Transformer, switch to a different account type.
Manage underscores in Snowflake connection information
Starting with the Snowflake JDBC driver 3.13.25 release in November 2022, the Snowflake JDBC driver converts underscores to hyphens, by default.

This can adversely affect communicating with Snowflake when Snowflake connection information specified in a Snowflake stage or connection, such as a URL, includes underscores. When needed, you can bypass this behavior by setting the allowUnderscoresInHost driver property to true. For more information and alternate solutions, see this Snowflake community article.

5.3.0 Fixed Issues

  • In a PySpark processor, you cannot create views from inputs and then run queries on those views.
  • If a stage contains a property that can use batch functions, such as the Directory Path property of the File destination, and that stage references a pipeline parameter, then evaluation of the pipeline parameter fails.
  • The Delta Lake destination generates an error when using multiple merge keys in an upsert operation.

5.3.x Known Issues

  • Pipelines that run on an EMR Serverless application and contain a Snowflake Lookup processor generate a validation error.
  • In Databricks 11.3 clusters, Transformer does not support Oracle 19 databases in JDBC stages.
  • Transformer cannot locate a separate runtime properties file that has been uploaded as an external resource for the engine.

    Workaround: Define runtime properties in the Transformer configuration properties instead of in a separate runtime properties file.

  • The Snowflake destination fails when attempting to create a new table and the destination is configured as follows:
    • Column Mapping Mode property is set to "By Name"
    • Write Mode property is set to "Append rows to existing table or create table if none exists"
    The pipeline produces the following error:
    Caused by: net.snowflake.client.jdbc.SnowflakeSQLException: SQL compilation error:
    Object 'X_DB.X_SCHEMA.X_TABLE' does not exist or not authorized
    

    Workaround: Create the table before running the pipeline.

  • Pipelines that run on a Spark 3.x Cloudera Data Engineering cluster can fail when they include a File stage that reads from or writes to HDFS with Kerberos enabled.
  • When trying to access Amazon Redshift or Microsoft SQL Server from a Google Dataproc pipeline, the pipeline fails due to a known issue with the default Dataproc Java security provider, Conscrypt.

    SQL Server pipelines fail with the following error:

    Last callback from Transformer Spark application exceeds the specified time configured in 'transformer.driver.max.inactive.interval' : 30000 ms. Check logs for error.
    Redshift pipelines fail with the following error:
    Could not connect to Redshift DB due to error: [JDBC Driver] null

    Workaround: Use one of the following methods to disable Conscrypt:

    • For existing clusters, on all cluster nodes, edit the $JAVA_HOME/jre/lib/security/java.security security configuration file, and disable the org.conscrypt.OpenSSLProvider property.
    • For new clusters, provision clusters with the dataproc:dataproc.conscrypt.provider.enable property set to false.
  • When previewing a Dataproc pipeline that is configured both to provision and to terminate a cluster, Transformer fails to terminate the cluster when preview completes.
  • At this time, you cannot preview data using the cluster manager configured for the pipeline when the pipeline includes Delta Lake, Hive, or MapR Hive stages. Pipelines containing Hive and MapR Hive stages can produce results, but use the metastore URI in the Hive configuration file, ignoring the optional Metastore URI stage property.
  • Pipelines that require passing resources such as private key files to Spark executors fail to run on any cluster type except EMR, where this issue has been fixed.

5.2.x Release Notes

The Transformer 5.2.0 release occurred on October 27, 2022.

New Features and Enhancements

New stage
XML Parser processor - You can use the new XML Parser processor to parse an XML document in a string field and pass the parsed data to a map field.
Database table origins
The MySQL JDBC Table, Oracle JDBC Table, PostgreSQL JDBC Table, and SQL Server JDBC Table origins include the following enhancements:
  • Support for multiple tables - These origins can read from multiple tables. The Table property in previous releases has become the Tables property. With one of these origins configured to read from multiple tables, a pipeline processes multiple batches, in both batch and streaming mode.
  • Batch headers - Pipelines with these origins generate batch headers for each batch and the origin writes the jdbc.table attribute in the header. The attribute stores the name of the table that the origin reads for the batch.
Other stage enhancements
  • Data Formats property - The ADLS Gen1, ADLS Gen2, Amazon S3, File, and Google Cloud Storage origins have an Additional Data Format Configuration property on the Data Formats tab. You can use this property to enter other data format parameters.
  • File-based destinations - The ADLS Gen1, ADLS Gen2, Amazon S3, File, and Google Cloud Storage destinations generate a validation error when writing in the Text data format if the Partition by Fields property is enabled.
  • Security option for Kafka stages - Kafka stages support the Custom Authentication (Security Protocol=CUSTOM) security option. Use the option to specify custom properties that contain the information required by a security protocol, rather than using predefined properties associated with other security options.
  • Snowflake destination - When merging data, the destination supports multiple join keys.
Clusters
  • Amazon EMR clusters - You can set the Amazon EMR runtime role with the Execution Role property.
  • Google Dataproc clusters
    • In pipelines running on Dataproc 2.0.40 clusters, you can include the Amazon Redshift destination and the Avro data format in Amazon S3 stages and Google Cloud Storage stages.
    • You can access details about the Dataproc job run for a Transformer pipeline in one of the following ways:
      • After the corresponding Control Hub job completes, view the job run summary from the job History tab which displays the Dataproc Job URL. Use the URL to access the Dataproc job in the Google Cloud Console.
      • Log into the Google Cloud Console and view the list of jobs run on the Dataproc cluster. Filter the jobs by the streamsets-transformer-pipeline-id or streamsets-transformer-pipeline-name label which are applied to all Dataproc jobs run for Transformer pipelines.
Additional enhancements
  • Expression language - Batch functions retrieve the value of an attribute in a batch header. You can use the functions in specific properties of destination stages.
  • Credential stores
    • Google Secret Manager - You can use Google Secret Manager as a credential store for Transformer.
    • Property for migration - You can use a new credentialStores.usePortableGroups credential stores property to migrate pipelines that access credential stores from one Control Hub organization to another. Contact StreamSets Support before enabling this option.

Upgrade Impact

Review use of generated XML files
In Transformer 5.2.0 prebuilt with Scala 2.12, the XML files that destinations write include the following initial XML declaration:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

If you use Transformer prebuilt with Scala 2.12, then after you upgrade to 5.2.0, review how pipelines use the generated files and make necessary changes to account for the new initial declaration.

Update SDK code for database-vendor-specific JDBC origins
In Transformer 5.2.0, four origins – MySQL JDBC Table, Oracle JDBC Table, PostgreSQL JDBC Table, and SQL Server JDBC Table – replace the Table property that accepted a single table with the Tables property that accepts a list of multiple tables.

After you upgrade to 5.2.0, review your SDK for Python code for these origins and replace origin.table with origin.tables.

5.2.0 Fixed Issues

  • File-based stages in Transformer do not work correctly with Avro files in Spark 2.4.7 clusters.
  • Some Databricks cluster configuration parameters for provisioning are ignored.
  • Pipelines running on a Dataproc cluster fail, unable to access or create a directory specified in a resource file.

5.2.x Known Issues

  • Transformer cannot locate a separate runtime properties file that has been uploaded as an external resource for the engine.

    Workaround: Define runtime properties in the Transformer configuration properties instead of in a separate runtime properties file.

  • The Snowflake destination fails when attempting to create a new table and the destination is configured as follows:
    • Column Mapping Mode property is set to "By Name"
    • Write Mode property is set to "Append rows to existing table or create table if none exists"
    The pipeline produces the following error:
    Caused by: net.snowflake.client.jdbc.SnowflakeSQLException: SQL compilation error:
    Object 'X_DB.X_SCHEMA.X_TABLE' does not exist or not authorized
    

    Workaround: Create the table before running the pipeline.

  • In a PySpark processor, you cannot create views from inputs and then run queries on those views.
    For example, you cannot use the following code:
    inputs[0].createOrReplaceTempView('test999')
    output = spark.sql('select * from test999')

    Workaround: Recreate the DataFrame in the Python Spark session.

    For example:
    df = spark.createDataFrame(inputs[0].rdd)
    df.createOrReplaceTempView("test999")
    output = spark.sql('select * from test999')
  • Pipelines that run on a Spark 3.x Cloudera Data Engineering cluster can fail when they include a File stage that reads from or writes to HDFS with Kerberos enabled.
  • When trying to access Amazon Redshift or Microsoft SQL Server from a Google Dataproc pipeline, the pipeline fails due to a known issue with the default Dataproc Java security provider, Conscrypt.

    SQL Server pipelines fail with the following error:

    Last callback from Transformer Spark application exceeds the specified time configured in 'transformer.driver.max.inactive.interval' : 30000 ms. Check logs for error.
    Redshift pipelines fail with the following error:
    Could not connect to Redshift DB due to error: [JDBC Driver] null

    Workaround: Use one of the following methods to disable Conscrypt:

    • For existing clusters, on all cluster nodes, edit the $JAVA_HOME/jre/lib/security/java.security security configuration file, and disable the org.conscrypt.OpenSSLProvider property.
    • For new clusters, provision clusters with the dataproc:dataproc.conscrypt.provider.enable property set to false.
  • When previewing a Dataproc pipeline that is configured both to provision and to terminate a cluster, Transformer fails to terminate the cluster when preview completes.
  • At this time, you cannot preview data using the cluster manager configured for the pipeline when the pipeline includes Delta Lake, Hive, or MapR Hive stages. Pipelines containing Hive and MapR Hive stages can produce results, but use the metastore URI in the Hive configuration file, ignoring the optional Metastore URI stage property.
  • Pipelines that require passing resources such as private key files to Spark executors fail to run on any cluster type except EMR, where this issue has been fixed.

5.1.x Release Notes

The Transformer 5.1.0 release occurred on July 26, 2022.

New Features and Enhancements

Cluster support
  • Databricks 10.4 - You can run pipelines on Databricks 10.4 clusters.
  • Dataproc 2.0.40 - You can run pipelines on Dataproc 2.0.40 clusters. However, with this version pipelines cannot include the Amazon Redshift destination or the Avro data format in Amazon S3 stages or Google Cloud Storage stages.
Database table origins
This release includes the following enhancements for the MySQL JDBC Table, Oracle JDBC Table, PostgreSQL JDBC Table, and SQL Server JDBC Table origins:
  • Maximum number of partitions - The Number of Partitions property in previous releases has become the Maximum Number of Partitions property. When the pipeline runs, the origin creates up to the specified number of partitions. This allows the pipeline to run if the origin cannot create the specified number of partitions.

    In previous versions, if an origin cannot create the specified number of partitions, the pipeline fails.

  • Automatic partition selection - When you configure an origin to skip offset tracking, the origin attempts to select a logical partition column for the read. For more information, see “Partition Column Selection” in the documentation for the origin.

    In previous versions, when an offset column could not be found to be used for partitioning, the pipeline fails.

Runtime parameters
You can use runtime parameters to represent a stage or pipeline property that displays as a list of configurations. For example, you can use a runtime parameter to define the Additional JDBC Configuration Properties for the JDBC Table origins.

5.1.x Known Issues

  • Pipelines that run on a Spark 3.x Cloudera Data Engineering cluster can fail when they include a File stage that reads from or writes to HDFS with Kerberos enabled.
  • When trying to access Amazon Redshift or Microsoft SQL Server from a Google Dataproc pipeline, the pipeline fails due to a known issue with the default Dataproc Java security provider, Conscrypt.

    SQL Server pipelines fail with the following error:

    Last callback from Transformer Spark application exceeds the specified time configured in 'transformer.driver.max.inactive.interval' : 30000 ms. Check logs for error.
    Redshift pipelines fail with the following error:
    Could not connect to Redshift DB due to error: [JDBC Driver] null

    Workaround: Use one of the following methods to disable Conscrypt:

    • For existing clusters, on all cluster nodes, edit the $JAVA_HOME/jre/lib/security/java.security security configuration file, and disable the org.conscrypt.OpenSSLProvider property.
    • For new clusters, provision clusters with the dataproc:dataproc.conscrypt.provider.enable property set to false.
  • When previewing a Dataproc pipeline that is configured both to provision and to terminate a cluster, Transformer fails to terminate the cluster when preview completes.
  • At this time, you cannot preview data using the cluster manager configured for the pipeline when the pipeline includes Delta Lake, Hive, or MapR Hive stages. Pipelines containing Hive and MapR Hive stages can produce results, but use the metastore URI in the Hive configuration file, ignoring the optional Metastore URI stage property.
  • Pipelines that require passing resources such as private key files to Spark executors fail to run on any cluster type except EMR, where this issue has been fixed.
  • Kafka stages include a Custom Authentication (Security Protocol=CUSTOM) option for the Security Option property and related custom security properties that are not yet supported. Do not use the custom security option or specify custom security properties. When defined, custom security properties are ignored.

5.0.x Release Notes

The Transformer 5.0.0 release occurred on May 30, 2022.

New Features and Enhancements

Stage enhancements
  • Snowflake stages property rename and enhancement - The Additional Snowflake Configuration Properties property is now named Connection Properties and is moved from the Advanced tab to the Connection tab. In addition, you can specify credential functions in the property value to retrieve secrets stored in a credential store. This change affects the following stages:
    • Snowflake origin
    • Snowflake Lookup processor
    • Snowflake destination
Transformer logs
Transformer uses the Apache Log4j 2.17.2 library to write log data. In previous releases, Transformer used the Apache Log4j 1.x library which is now end-of-life.
Proxy server configuration
To configure Transformer to use a proxy server for outbound network requests, define proxy properties when you set up the deployment.
Previously, you configured Transformer to use a proxy server by defining Java configuration options for the deployment and then setting the STREAMSETS_BOOTSTRAP_JAVA_OPTS environment variable on the Transformer machine.

5.0.0 Fixed Issues

  • You cannot preview or validate a pipeline using embedded Spark libraries.
  • Transformer 4.0.0 or later cannot load runtime resources for a pipeline running on a Hadoop YARN Cloudera distribution cluster.

5.0.x Known Issues

  • Pipelines that run on a Spark 3.x Cloudera Data Engineering cluster can fail when they include a File stage that reads from or writes to HDFS with Kerberos enabled.
  • When trying to access Amazon Redshift or Microsoft SQL Server from a Google Dataproc pipeline, the pipeline fails due to a known issue with the default Dataproc Java security provider, Conscrypt.

    SQL Server pipelines fail with the following error:

    Last callback from Transformer Spark application exceeds the specified time configured in 'transformer.driver.max.inactive.interval' : 30000 ms. Check logs for error.
    Redshift pipelines fail with the following error:
    Could not connect to Redshift DB due to error: [JDBC Driver] null

    Workaround: Use one of the following methods to disable Conscrypt:

    • For existing clusters, on all cluster nodes, edit the $JAVA_HOME/jre/lib/security/java.security security configuration file, and disable the org.conscrypt.OpenSSLProvider property.
    • For new clusters, provision clusters with the dataproc:dataproc.conscrypt.provider.enable property set to false.
  • Due to memory issues in older Databricks clusters, communication failures can occur when running pipelines on those clusters. The memory issues can generate error messages such as the following:
    Failed to start the pipeline, Databricks Cluster Driver is up but is not responsive, likely due to GC.
    GC overhead limit exceeded
    Driver Error: Last callback from Transformer Spark application exceeds the specified time configured in 'transformer.driver.max.inactive.interval'
    Workaround: To address the memory issues, try one or both of the following solutions:
    • Fine tune the Spark configuration properties related to memory, such as spark.driver.memory, spark.driver.cores, spark.executor.memory, and spark.executor.cores.
    • Increase the memory on Spark cluster nodes.
  • When previewing a Dataproc pipeline that is configured both to provision and to terminate a cluster, Transformer fails to terminate the cluster when preview completes.
  • A successful Transformer pipeline run on a provisioned Databricks cluster displays a Cancelled or Failed status in Databricks.

    Workaround: When the Databricks job status for a pipeline on a provisioned Databricks cluster differs from job status in Control Hub, trust the status reported by Control Hub.

  • At this time, you cannot preview data using the cluster manager configured for the pipeline when the pipeline includes Delta Lake, Hive, or MapR Hive stages. Pipelines containing Hive and MapR Hive stages can produce results, but use the metastore URI in the Hive configuration file, ignoring the optional Metastore URI stage property.
  • Starting a cluster pipeline multiple times, in quick succession, can cause the pipeline to hang with the following error:

    Transformer Spark app is already running. Waiting for callback...

    Workaround: Wait a few seconds before starting the pipeline again.

  • Pipelines that require passing resources such as private key files to Spark executors fail to run on any cluster type except EMR, where this issue has been fixed.
  • The Google BigQuery destination fails to write numeric data with a scale greater than 9 to BigQuery columns defined as BIGNUMERIC data type.
  • Restarting a cluster pipeline shortly after starting it can cause the cluster to use the same Spark application ID for both pipeline runs, leading to errors.

    Workaround: Wait a few seconds before starting the pipeline again.

  • The File origin processes files with mixed schemas.

4.3.x Release Notes

The Transformer 4.3.0 release occurred on April 29, 2022.

New Features and Enhancements

Cluster support
New stage
Stage enhancements
  • Amazon S3 origin property rename - The Bucket property is now named Bucket and Path. It has always allowed entering a path that includes the asterisk (*) and question mark (?) wildcards.
  • New empty dataframe behavior for JDBC Table origins - When there is no data to be read, the following origins now pass the table schema in an empty dataframe:
    • JDBC Table origin
    • MySQL JDBC Table origin
    • Oracle JDBC Table origin
    • PostgreSQL JDBC Table origin

    In previous releases, these origins passed an empty schema with empty dataframes. This change has no upgrade impact because it includes a new Use Empty Schemas property that passes an empty schema with empty dataframes.

    To preserve backward compatibility, the Use Empty Schemas property is enabled for all upgraded pipelines. For new pipelines, this property is disabled by default.

  • Partition Base Path origin property - The following origins now allow specifying a base path for partitions in a Partition Base Path property:
    • ADLS Gen1 origin
    • ADLS Gen2 origin
    • Amazon S3 origin
    • File origin
    • Google Cloud Storage origin
    • MapR FS origin
  • Skip Empty Batches destination property - The following destinations can now skip writing empty batches when you select the Skip Empty Batches property:
    • ADLS Gen1 destination
    • ADLS Gen2 destination
    • Amazon S3 destination
    • File destination
    • Google Cloud Storage destination
    • MapR FS destination

4.3.0 Fixed Issues

  • When a job has failover enabled and the Transformer that runs the pipeline fails, and the same Transformer attempts to recover the pipeline, the pipeline enters a RUN_ERROR state and fails with the following error message:
    RUN_ERROR:Error while trying to check the status of disconnected driver pipeline: st.shaded.org.glassfish.jersey.message.internal.MessageBodyProviderNotFoundException: MessageBodyReader not found for media type=text/html;charset=utf-8, type=interface com.streamsets.datacollector.execution.PipelineState, genericType=interface com.streamsets.datacollector.execution.PipelineState.

4.3.x Known Issues

  • You cannot preview or validate a pipeline using embedded Spark libraries.
  • Pipelines that run on a Spark 3.x Cloudera Data Engineering cluster can fail when they include a File stage that reads from or writes to HDFS with Kerberos enabled.
  • When trying to access Amazon Redshift or Microsoft SQL Server from a Google Dataproc pipeline, the pipeline fails due to a known issue with the default Dataproc Java security provider, Conscrypt.

    SQL Server pipelines fail with the following error:

    Last callback from Transformer Spark application exceeds the specified time configured in 'transformer.driver.max.inactive.interval' : 30000 ms. Check logs for error.
    Redshift pipelines fail with the following error:
    Could not connect to Redshift DB due to error: [JDBC Driver] null

    Workaround: Use one of the following methods to disable Conscrypt:

    • For existing clusters, on all cluster nodes, edit the $JAVA_HOME/jre/lib/security/java.security security configuration file, and disable the org.conscrypt.OpenSSLProvider property.
    • For new clusters, provision clusters with the dataproc:dataproc.conscrypt.provider.enable property set to false.
  • Due to memory issues in older Databricks clusters, communication failures can occur when running pipelines on those clusters. The memory issues can generate error messages such as the following:
    Failed to start the pipeline, Databricks Cluster Driver is up but is not responsive, likely due to GC.
    GC overhead limit exceeded
    Driver Error: Last callback from Transformer Spark application exceeds the specified time configured in 'transformer.driver.max.inactive.interval'
    Workaround: To address the memory issues, try one or both of the following solutions:
    • Fine tune the Spark configuration properties related to memory, such as spark.driver.memory, spark.driver.cores, spark.executor.memory, and spark.executor.cores.
    • Increase the memory on Spark cluster nodes.
  • When previewing a Dataproc pipeline that is configured both to provision and to terminate a cluster, Transformer fails to terminate the cluster when preview completes.
  • A successful Transformer pipeline run on a provisioned Databricks cluster displays a Cancelled or Failed status in Databricks.

    Workaround: When the Databricks job status for a pipeline on a provisioned Databricks cluster differs from job status in Control Hub, trust the status reported by Control Hub.

  • At this time, you cannot preview data using the cluster manager configured for the pipeline when the pipeline includes Delta Lake, Hive, or MapR Hive stages. Pipelines containing Hive and MapR Hive stages can produce results, but use the metastore URI in the Hive configuration file, ignoring the optional Metastore URI stage property.
  • Starting a cluster pipeline multiple times, in quick succession, can cause the pipeline to hang with the following error:

    Transformer Spark app is already running. Waiting for callback...

    Workaround: Wait a few seconds before starting the pipeline again.

  • Pipelines that require passing resources such as private key files to Spark executors fail to run on any cluster type except EMR, where this issue has been fixed.
  • The Google BigQuery destination fails to write numeric data with a scale greater than 9 to BigQuery columns defined as BIGNUMERIC data type.
  • Restarting a cluster pipeline shortly after starting it can cause the cluster to use the same Spark application ID for both pipeline runs, leading to errors.

    Workaround: Wait a few seconds before starting the pipeline again.

  • The File origin processes files with mixed schemas.

4.2.x Release Notes

The Transformer 4.2.0 release occurred on January 21, 2022.

New Features and Enhancements

Internal update
This release includes internal updates to support an upcoming Control Hub feature on the StreamSets platform.
Note: All new Transformer deployments on the StreamSets platform will use Transformer version 4.2.0 or higher. Existing deployments are not affected.
Clusters
Stage enhancements
  • Amazon S3 destination - When configuring the destination, you can now use the s3 URI scheme, in addition to the s3a scheme. Best practice is to use s3 with EMR clusters and s3a with all other clusters.
  • Field Replacer processor - Use Spark SQL expressions to generate new values for specified fields. You can use quotation marks to specify a string.
  • Google Big Query origin - The origin can now read from Google BigQuery views.
Connections
  • Amazon EMR cluster connections include the following enhancements:
    • You can configure Amazon EMR cluster connections to assume another role.
    • You can specify a SSH EC2 Key ID property for the EC2 SSH key to be used on nodes of the cluster.
Deprecation and testing update
  • The Cloudera CDH 5.x stage libraries are now deprecated. As a result, StreamSets no longer tests Transformer against Cloudera CDH 5.x.
Additional enhancements
  • CyberArk credential store support - You can use CyberArk as a credential store for Transformer.
  • Cluster URL access - When monitoring a Control Hub job for a Databricks pipeline, when you view the job summary, you can now access the Databricks cluster job URL.

4.2.0 Fixed Issues

  • Redshift destinations fail to write partitioned data when running on Databricks cluster versions 7.x and later. The pipeline fails with the following error:
    java.sql.SQLException: Invalid operation: Mandatory url is not present in manifest file.
  • The Scala processor always checks if a batch is empty instead of checking only when the Skip Empty Batches property is enabled. This slows performance.
  • Runtime resources are not accessible from Transformer pipelines.
  • When provisioning a Databricks cluster, the user-defined tags defined in cluster configuration properties are not being set.
  • When provisioning a Databricks cluster, the policy_id parameter defined in the cluster configuration properties is ignored.
  • For pipelines run on Databricks clusters, resources are staged in the EBS volumes instead of the Databricks distributed file system (DBFS), and the resources are not being removed when no longer needed.

4.2.x Known Issues

  • As noted in the StreamSets Technical Service Bulletin, Transformer 3.12.0 and later are not vulnerable to the Apache Log4j zero-day vulnerability documented in CVE-2021-44228.

    However, StreamSets highly recommends that you update all clusters that run Transformer pipelines to protect against the zero-day vulnerability.

  • When trying to access Amazon Redshift or Microsoft SQL Server from a Google Dataproc pipeline, the pipeline fails due to a known issue with the default Dataproc Java security provider, Conscrypt.

    SQL Server pipelines fail with the following error:

    Last callback from Transformer Spark application exceeds the specified time configured in 'transformer.driver.max.inactive.interval' : 30000 ms. Check logs for error.
    Redshift pipelines fail with the following error:
    Could not connect to Redshift DB due to error: [JDBC Driver] null

    Workaround: Use one of the following methods to disable Conscrypt:

    • For existing clusters, on all cluster nodes, edit the $JAVA_HOME/jre/lib/security/java.security security configuration file, and disable the org.conscrypt.OpenSSLProvider property.
    • For new clusters, provision clusters with the dataproc:dataproc.conscrypt.provider.enable property set to false.
  • Due to memory issues in older Databricks clusters, communication failures can occur when running pipelines on those clusters. The memory issues can generate error messages such as the following:
    Failed to start the pipeline, Databricks Cluster Driver is up but is not responsive, likely due to GC.
    
    GC overhead limit exceeded
    
    Driver Error: Last callback from Transformer Spark application exceeds the specified time configured in 'transformer.driver.max.inactive.interval'
    Workaround: To address the memory issues, try one or both of the following solutions:
    • Fine tune the Spark configuration properties related to memory, such as spark.driver.memory, spark.driver.cores, spark.executor.memory, and spark.executor.cores.
    • Increase the memory on Spark cluster nodes.
  • When previewing a Dataproc pipeline that is configured both to provision and to terminate a cluster, Transformer fails to terminate the cluster when preview completes.
  • When a job has failover enabled and the Transformer that runs the pipeline fails, and the same Transformer attempts to recover the pipeline, the pipeline enters a RUN_ERROR state and fails with the following error message:
    RUN_ERROR:Error while trying to check the status of disconnected driver pipeline: st.shaded.org.glassfish.jersey.message.internal.MessageBodyProviderNotFoundException: MessageBodyReader not found for media type=text/html;charset=utf-8, type=interface com.streamsets.datacollector.execution.PipelineState, genericType=interface com.streamsets.datacollector.execution.PipelineState.
  • A successful Transformer pipeline run on a provisioned Databricks cluster displays a Cancelled or Failed status in Databricks.

    Workaround: When the Databricks job status for a pipeline on a provisioned Databricks cluster differs from job status in Control Hub, trust the status reported by Control Hub.

  • At this time, you cannot preview data using the cluster manager configured for the pipeline when the pipeline includes Delta Lake, Hive, or MapR Hive stages. Pipelines containing Hive and MapR Hive stages can produce results, but use the metastore URI in the Hive configuration file, ignoring the optional Metastore URI stage property.
  • Starting a cluster pipeline multiple times, in quick succession, can cause the pipeline to hang with the following error:

    Transformer Spark app is already running. Waiting for callback...

    Workaround: Wait a few seconds before starting the pipeline again.

  • Pipelines that require passing resources such as private key files to Spark executors fail to run on any cluster type except EMR, where this issue has been fixed.
  • The Google BigQuery destination fails to write numeric data with a scale greater than 9 to BigQuery columns defined as BIGNUMERIC data type.
  • Restarting a cluster pipeline shortly after starting it can cause the cluster to use the same Spark application ID for both pipeline runs, leading to errors.

    Workaround: Wait a few seconds before starting the pipeline again.

  • The File origin processes files with mixed schemas.

4.1.x Release Notes

The Transformer 4.1.0 release occurred on September 27, 2021.

New Features and Enhancements

Clusters
  • Amazon EMR:
    • AWS tags - When provisioning an Amazon EMR cluster, you can specify AWS tags for the cluster.
    • Regions - You can now specify additional regions for EMR clusters.
  • Google Dataproc:
    • Dataproc labels - When provisioning a Google Dataproc cluster, you can specify Dataproc labels for the cluster.
    • Credentials files - You can now specify relative paths in addition to absolute paths to service account credentials files.
    • Regions - You can now specify additional regions for Dataproc clusters.
  • Databricks:
    • Job submission - When you start a pipeline, Transformer now submits the pipeline to a Databricks cluster directly as a workload, creating an ephemeral job.
      Previously, Transformer created one-time jobs, which counted against the job limit on the account. Ephemeral jobs do not count towards the job limit.
      Note: The details of ephemeral jobs do not display with regular jobs through the Databricks job menu. For details, see Upgrade Impact.
    • Init script enhancement - When provisioning a Databricks cluster on Azure, you can now use Azure cluster-scoped init scripts stored on Azure Blob File System that are accessible using an ADLS Gen2 storage account.
Stages
  • New JDBC Query origin - Use the JDBC Query origin to read data from database tables with a custom query.
  • JDBC origin renamed - To clarify the difference between this existing origin and the new JDBC Query origin, the JDBC origin is now known as the JDBC Table origin.
Credential stores
Additional enhancements
  • Job functions - You can now use job functions when you configure any pipeline property that allows expressions.
  • Enabling HTTPS for Transformer - You can now store the keystore and truststore files in the Transformer resources directory, <installation_dir>/externalResources/resources, and then enter a path relative to that directory when you define the keystore and truststore location. This can have upgrade impact.

Upgrade Impact

Java JDK 11 enforcement for Scala 2.12 installations
With this release, when Transformer is prebuilt with Scala 2.12, it requires a Java JDK 11 installation. In previous releases, though required by Transformer prebuilt with Scala 2.12, a Java JDK 11 installation was not enforced.
If you upgrade to Transformer 4.1.0 prebuilt with Scala 2.12, you must have Java JDK 11 installed on the Transformer machine for Transformer to start.
Databricks job submission change
With this release, Transformer submits jobs to Databricks differently from previous releases.
In previous releases, with each pipeline run, Transformer creates a standard Databricks job, but uses it only once. This job counts toward the Databricks jobs limit.
With this release, Transformer submits ephemeral jobs to Databricks. An ephemeral job runs only once, and does not count towards the Databricks job limit. However, the details for the ephemeral jobs are only available for 60 days, and are not available through the Databricks jobs menu. For information about accessing job details, see Accessing Databricks Job Details.
HDInsight pipelines with ADLS stages
With this release, when you include an ADLS Gen1 or Gen2 stage in a pipeline that runs on an Apache Spark for HDInsight cluster, the stage must use the ADLS cluster-provided libraries stage library.
Enabling HTTPS for Transformer
With this release, when you enable HTTPS for Transformer, you can store the keystore and truststore files in the Transformer resources directory, <installation_dir>/externalResources/resources. You can then enter a path relative to that directory when you define the keystore and truststore location in the Transformer configuration properties.
In previous releases, you could store the keystore and truststore files in the Transformer configuration directory, <installation_dir>/etc, and then define the location to the file using a path relative to that directory. You can continue to store the file in the configuration directory, but StreamSets recommends moving it to the resources directory when you upgrade.

4.1.0 Fixed Issues

  • Pipelines with ADLS stages that run on Azure HDInsight 4.0 clusters with Transformer built for Spark 2.4 fail to start. This fix might cause upgrade impact.
  • Transformer does not enforce the Java JDK 11 requirement for Transformer prebuilt with Scala 2.12. This fix might cause upgrade impact.
  • When pipeline failover is enabled for a Control Hub job that runs a Transformer pipeline, the job can hang in a failover Transformer in a STARTING state when the Spark job completes before the failover Transformer fully takes over the Control Hub job.
  • Record header attributes are added to Hive tables as new columns during pipeline preview if the Write to Destinations preview property is enabled.
  • A pipeline fails to start when a Kafka origin is configured to read messages starting from a specified offset.

4.1.x Known Issues

  • When provisioning a Databricks cluster, user-defined tags defined in cluster configuration properties are not being set.
  • A successful Transformer pipeline run on a provisioned Databricks cluster displays a Cancelled or Failed status in Databricks.

    Workaround: When the Databricks job status for a pipeline on a provisioned Databricks cluster differs from job status in Control Hub, trust the status reported by Control Hub.

  • At this time, you cannot preview data using the cluster manager configured for the pipeline when the pipeline includes Delta Lake, Hive, or MapR Hive stages. Pipelines containing Hive and MapR Hive stages can produce results, but use the metastore URI in the Hive configuration file, ignoring the optional Metastore URI stage property.
  • Starting a cluster pipeline multiple times, in quick succession, can cause the pipeline to hang with the following error:

    Transformer Spark app is already running. Waiting for callback...

    Workaround: Wait a few seconds before starting the pipeline again.

  • Pipelines that require passing resources such as private key files to Spark executors fail to run on any cluster type except EMR, where this issue has been fixed.
  • The Google BigQuery destination fails to write numeric data with a scale greater than 9 to BigQuery columns defined as BIGNUMERIC data type.
  • Restarting a cluster pipeline shortly after starting it can cause the cluster to use the same Spark application ID for both pipeline runs, leading to errors.

    Workaround: Wait a few seconds before starting the pipeline again.

  • The File origin processes files with mixed schemas.

4.0.x Release Notes

The Transformer 4.0.0 release occurred on June 21, 2021.

New Features and Enhancements

Spark 3 and Scala 2.12 support

Transformer supports using Spark 3.0 and Scala 2.12 for some cluster types. As a result, StreamSets now provides different installation packages for Transformer.

For information about the clusters that support Spark 3.0, see Cluster Compatibility Matrix. For information about the features available in different versions of Spark, see Spark Versions and Available Features.

Stages
  • New Amazon Redshift origin - Use the Amazon Redshift origin to read data from an Amazon Redshift table.
Clusters
  • Amazon EMR enhancements:
    • Additional EMR support - You can run pipelines on EMR 6.1.x or later 6.x.x clusters. For all supported versions, see Cluster Compatibility Matrix.
    • Bootstrap actions support - When you provision a cluster, you can define bootstrap actions scripts in cluster configuration properties or you can use bootstrap actions scripts stored on Amazon S3.
  • Databricks clusters:
    • Additional Databricks support - You can run pipelines on Databricks 7.x and 8.x clusters. For all supported versions, see Cluster Compatibility Matrix.
    • Cluster-scoped init script support - When you provision a cluster, you can define cluster-scoped init scripts in cluster configuration properties. You can also use cluster-scoped init scripts stored on DBFS or S3. Specifying a location on Azure is not available at this time.
    • Databricks failover support - You can configure pipeline failover for Databricks pipelines.
  • Application Name enhancement - When specifying an application name for a cluster, you can now use underscores in addition to alphanumeric characters.
Connections

With this release, the following stages support using connections:

  • MySQL JDBC Table origin
  • Oracle JDBC Table origin
  • PostgreSQL JDBC Table origin
  • SQL Server JDBC Table origin
  • Amazon Redshift origin and destination - Available after the Data Collector 4.1.0 release.
Additional enhancements
  • TRANSFORMER_EXTERNAL_RESOURCES environment variable - An optional root directory for external resources, such as external libraries and runtime resources.

    The default location is $TRANSFORMER_DIST/externalResources.

4.0.0 Fixed Issues

  • When you force stop an EMR pipeline, the Spark job on EMR continues to run until the last batch is written.

    With this fix, when you force stop an EMR pipeline, Transformer first tries to stop the Spark job through the YARN service in the cluster. If the YARN service is not reachable, Transformer sends a new step to the EMR cluster with the stop command.

    As a result, if the YARN service is not reachable, Transformer can only force stop the pipeline when all of the following are true:
    • The pipeline runs on EMR 5.28 or later with support for step concurrency.
    • The Step Concurrency property in the pipeline is set to 2 or higher.
    • A step becomes available.
  • When a Databricks pipeline successfully completes, Transformer indicates that it has finished running. However, on the Databricks cluster, the Spark job seems to be cancelled instead.
  • Pipelines that require passing resources such as private key files to Spark executors fail to run on EMR clusters.
  • Upgrading pipelines with the Amazon S3 destination created on Transformer 3.15.0 or earlier to Transformer 3.16.x - 3.18.x can generate errors related to the Partition by Fields stage property.

  • Errors occur when using the Amazon S3 origin and destination in the same pipeline when reading from and writing to different regions.
  • The Field Renamer processor does not rename fields for empty batches.

4.0.x Known Issues

  • A successful Transformer pipeline run on a provisioned Databricks cluster displays a Cancelled or Failed status in Databricks.

    Workaround: When the Databricks job status for a pipeline on a provisioned Databricks cluster differs from job status in Control Hub, trust the status reported by Control Hub.

  • At this time, you cannot preview data using the cluster manager configured for the pipeline when the pipeline includes Delta Lake, Hive, or MapR Hive stages. Pipelines containing Hive and MapR Hive stages can produce results, but use the metastore URI in the Hive configuration file, ignoring the optional Metastore URI stage property.When pipeline failover is enabled for a Control Hub job that runs a Transformer pipeline, and the Spark job completes before a failover Transformer fully takes over the Control Hub, the Control Hub job can hang in the failover Transformer in a STARTING state with the following error:

    CONTAINER_0102 - Cannot change state from STARTING to FINISHING

    Workaround: To correctly finish the Control Hub job, use Control Hub to force stop the job and wait until the job reaches an INACTIVE_ERROR state. Then, acknowledge the error.

  • Starting a cluster pipeline multiple times, in quick succession, can cause the pipeline to hang with the following error:

    Transformer Spark app is already running. Waiting for callback...

    Workaround: Wait a few seconds before starting the pipeline again.

  • Pipelines that require passing resources such as private key files to Spark executors fail to run on any cluster type except EMR, where this issue has been fixed.
  • A pipeline fails to start when a Kafka origin is configured to read messages starting from a specified offset.
  • The Google BigQuery destination fails to write numeric data with a scale greater than 9 to BigQuery columns defined as BIGNUMERIC data type.
  • Record header attributes are added to Hive tables as new columns during pipeline preview if the Write to Destinations preview property is enabled.
  • Restarting a cluster pipeline shortly after starting it can cause the cluster to use the same Spark application ID for both pipeline runs, leading to errors.

    Workaround: Wait a few seconds before starting the pipeline again.

  • The File origin processes files with mixed schemas.