Release Notes
5.9.x Release Notes
The Transformer 5.9.0 release occurred on October 29, 2024.
New Features and Enhancements
- Dataproc support
- You can run pipelines on Dataproc 2.2 clusters.
- HSTS support
- When you configure Transformer to use HTTPS, you can enable Transformer to provide an HTTP Strict Transport Security (HSTS) response header.
- Snowflake stages
- Snowflake stages include Transformer-provided stage libraries for Spark 3.5.
- Docker image upgrade
- Docker images for Transformer 5.9.0 include the following upgraded Spark
version:
streamsets/transformer:scala-2.12_5.9.0
now uses Spark 3.5.2, upgraded from Spark 3.4.1
- Deprecated support
- Spark 2.x support is deprecated with this release.
- Deprecated functionality
- The Advanced Error Handling pipeline property is also deprecated and will be removed in a future release. After the property is removed, the JDBC Table origin will no longer include the SQL query and results in the Transformer log.
5.9.x Known Issues
- If
you restart Transformer, then force stop a pipeline that runs on Spark Standalone cluster or a MapR
cluster with security enabled, Transformer can indicate that the pipeline has been stopped even though the pipeline
continues to run.
Workaround: Use Spark or YARN monitoring tools to track and manage those pipelines.
-
When used in a Dataproc 2.1 cluster, the BigQuery origin can cause the pipeline to fail with the following error when the Max Readers property is set to a value between 1-5, inclusive:
preferred_min_stream_count must be less than or equal to max_stream_count
Workaround: When possible, set the Max Readers property to 0 or a value greater than 6.
- Pipelines that run on an EMR Serverless application and contain a Snowflake Lookup processor generate a validation error.
- In Databricks 11.3 clusters, Transformer does not support Oracle 19 databases in JDBC stages.
- The Snowflake destination fails when attempting to create a new table and the
destination is configured as follows:
- Column Mapping Mode property is set to "By Name"
- Write Mode property is set to "Append rows to existing table or create table if none exists"
The pipeline produces the following error:Caused by: net.snowflake.client.jdbc.SnowflakeSQLException: SQL compilation error: Object 'X_DB.X_SCHEMA.X_TABLE' does not exist or not authorized
Workaround: Create the table before running the pipeline.
- Pipelines that run on a Spark 3.x Cloudera Data Engineering cluster can fail when they include a File stage that reads from or writes to HDFS with Kerberos enabled.
- When trying to access Amazon Redshift or Microsoft SQL Server from a Google Dataproc
pipeline, the pipeline fails due to a known issue with the default Dataproc Java security
provider, Conscrypt.
SQL Server pipelines fail with the following error:
Last callback from Transformer Spark application exceeds the specified time configured in 'transformer.driver.max.inactive.interval' : 30000 ms. Check logs for error.
Redshift pipelines fail with the following error:Could not connect to Redshift DB due to error: [JDBC Driver] null
Workaround: Use one of the following methods to disable Conscrypt:
- For existing clusters, on all cluster nodes, edit the
$JAVA_HOME/jre/lib/security/java.security
security configuration file, and disable theorg.conscrypt.OpenSSLProvider
property. - For new clusters, provision clusters with the
dataproc:dataproc.conscrypt.provider.enable
property set tofalse
.
- For existing clusters, on all cluster nodes, edit the
- When previewing a Dataproc pipeline that is configured both to provision and to terminate a cluster, Transformer fails to terminate the cluster when preview completes.
- At this time, you cannot preview data using the cluster manager configured for the pipeline when the pipeline includes Delta Lake, Hive, or MapR Hive stages. Pipelines containing Hive and MapR Hive stages can produce results, but use the metastore URI in the Hive configuration file, ignoring the optional Metastore URI stage property.
- Pipelines that require passing resources such as private key files to Spark executors fail to run on any cluster type except EMR, where this issue has been fixed.
5.8.x Release Notes
The Transformer 5.8.0 release occurred on July 3, 2024.
New Features and Enhancements
- Clusters
-
- You can run pipelines on the following cluster
versions:
- Databricks 14.3
- EMR 7.0.0
- EMR Serverless 7.0.0
- Hadoop YARN, Cloudera version CDP Private Cloud Base 7.1.x with Spark 3.x
- You can configure an EMR pipeline to terminate a cluster provisioned by AWS Service Catalog after the pipeline stops.
- You can run pipelines on the following cluster
versions:
- Snowflake stages
- The Snowflake
origin, Snowflake Lookup
processor, and Snowflake
destination include the following enhancements:
- The following properties are available when you configure a
Snowflake stage to use a Control Hub Snowflake connection. You can
use them to override the value in the related connection
property:
- Override Warehouse
- Override Database
- Override Schema
- The following property is available when you configure connection
information in the Snowflake stage. Use it to enable using a private
link URL:
- Use Private Link Snowflake URL
- Snowflake stages include Transformer-provided stage libraries for Spark 3.0 - 3.4. Previously, Transformer-provided stage libraries were for Spark 3.1. As a result, Snowflake stages in existing pipelines are updated to use the Transformer-provided stage library for Spark 3.1.
- The following properties are available when you configure a
Snowflake stage to use a Control Hub Snowflake connection. You can
use them to override the value in the related connection
property:
- Kudu support
- You can now use Kudu stages in pipelines that run on Spark 3.x clusters.
- Pipeline caching
-
Pipelines include the following new advanced properties:
- Cache Level - Determines how or where Spark caches data during a pipeline run.
- Cache Replicas - Determines how many replicas of cached data to create.
This enhancement does not affect existing pipelines. Existing pipelines have the Cache Level property set to
Memory and Disk
to ensure the previous behavior. - Azure Data Lake Storage Gen1 no longer supported
-
Microsoft retired Azure Data Lake Storage Gen1 in February 2024. As a result, this release includes the following changes:
-
ADLS Gen1 origin and destination - These stages, which were deprecated in an earlier version, have been removed.
-
Delta Lake origin, processor, and destination - These stages no longer support processing data stored on Azure Data Lake Storage Gen1.
-
- Product rename
- Following the IBM acquisition of StreamSets, Transformer is part of what is now known as IBM StreamSets for Apache Spark.
5.8.x Known Issues
- If you
restart Transformer, then force stop a pipeline that runs on Spark Standalone cluster or a MapR
cluster with security enabled, Transformer can indicate that the pipeline has been stopped even though the pipeline
continues to run.
Workaround: Use Spark or YARN monitoring tools to track and manage those pipelines.
-
When used in a Dataproc 2.1 cluster, the BigQuery origin can cause the pipeline to fail with the following error when the Max Readers property is set to a value between 1-5, inclusive:
preferred_min_stream_count must be less than or equal to max_stream_count
Workaround: When possible, set the Max Readers property to 0 or a value greater than 6.
- Pipelines that run on an EMR Serverless application and contain a Snowflake Lookup processor generate a validation error.
- In Databricks 11.3 clusters, Transformer does not support Oracle 19 databases in JDBC stages.
- The Snowflake destination fails when attempting to create a new table and the
destination is configured as follows:
- Column Mapping Mode property is set to "By Name"
- Write Mode property is set to "Append rows to existing table or create table if none exists"
The pipeline produces the following error:Caused by: net.snowflake.client.jdbc.SnowflakeSQLException: SQL compilation error: Object 'X_DB.X_SCHEMA.X_TABLE' does not exist or not authorized
Workaround: Create the table before running the pipeline.
- Pipelines that run on a Spark 3.x Cloudera Data Engineering cluster can fail when they include a File stage that reads from or writes to HDFS with Kerberos enabled.
- When trying to access Amazon Redshift or Microsoft SQL Server from a Google Dataproc
pipeline, the pipeline fails due to a known issue with the default Dataproc Java security
provider, Conscrypt.
SQL Server pipelines fail with the following error:
Last callback from Transformer Spark application exceeds the specified time configured in 'transformer.driver.max.inactive.interval' : 30000 ms. Check logs for error.
Redshift pipelines fail with the following error:Could not connect to Redshift DB due to error: [JDBC Driver] null
Workaround: Use one of the following methods to disable Conscrypt:
- For existing clusters, on all cluster nodes, edit the
$JAVA_HOME/jre/lib/security/java.security
security configuration file, and disable theorg.conscrypt.OpenSSLProvider
property. - For new clusters, provision clusters with the
dataproc:dataproc.conscrypt.provider.enable
property set tofalse
.
- For existing clusters, on all cluster nodes, edit the
- When previewing a Dataproc pipeline that is configured both to provision and to terminate a cluster, Transformer fails to terminate the cluster when preview completes.
- At this time, you cannot preview data using the cluster manager configured for the pipeline when the pipeline includes Delta Lake, Hive, or MapR Hive stages. Pipelines containing Hive and MapR Hive stages can produce results, but use the metastore URI in the Hive configuration file, ignoring the optional Metastore URI stage property.
- Pipelines that require passing resources such as private key files to Spark executors fail to run on any cluster type except EMR, where this issue has been fixed.
5.7.x Release Notes
The Transformer 5.7.0 release occurred on February 26, 2024.
New Features and Enhancements
- Clusters
-
- Databricks 13.3 support - You can now run pipelines on Databricks 13.3 clusters.
- External libraries on Databricks, Dataproc, EMR, and EMR Serverless
clusters - Transformer
fully
manages staging directories for these clusters. When you
add, remove, or update external libraries as external resources on
Transformer, Transformer now automatically updates the cluster staging directories when
you run a related pipeline.
Previously, Transformer only updated these cluster staging directories when you added external libraries or updated external libraries with the same base file name. Now, Transformer compares the external libraries installed on Transformer against cluster staging directories and updates these directories to match, as needed.
- EMR clusters
-
AWS Service Catalog support - You can configure an EMR pipeline to run on a cluster provisioned by AWS Service Catalog.
-
Define Cluster Start Option property - Use this property to specify whether the pipeline runs on an existing cluster, a cluster provisioned by Transformer, or a cluster provisioned by AWS Service Catalog.
In previous releases, you used the Provision a New Cluster property to specify whether the pipeline ran on an existing cluster or a cluster provisioned by Transformer. This update does not require changes to existing pipelines.
-
- Connections
- Amazon EMR Cluster connections include the same changes as EMR clusters:
- You can configure a connection to run a job on a cluster provisioned by AWS Service Catalog.
- The new Define Cluster Start Option property allows you to specify
whether the pipeline runs on an existing cluster, a cluster
provisioned by Transformer, or a cluster provisioned by AWS Service Catalog.
Previously, you used the Provision a New Cluster property to specify whether the pipeline ran on an existing cluster or a cluster provisioned by Transformer. This update does not require changes to existing pipelines.
- Transformer driver callback URL
-
Transformer includes a new driver callback URL property,
transformer.driver.callback.url
. This property defines the cluster callback URL for Spark to communicate with Transformer. Transformer uses the specified callback URL for all pipelines, unless overridden by the existing Cluster Callback URL property defined in individual pipelines.In previous releases the specified Transformer base URL,
transformer.base.http.url
, which defines how Control Hub communicates with Transformer, also acted as the cluster callback URL.This enhancement is not a behavior change. However, if you previously configured the Cluster Callback URL property in pipelines as a workaround to avoid using the Transformer base URL as the cluster callback URL, you can now simply define the new driver cluster callback URL Transformer property.
However, note that the Cluster Callback URL pipeline property, when defined, takes precedence over all other possible URLs. For more information, see Understanding the Spark Cluster Callback URL.
- Stages and libraries
-
- Library support:
- Delta Lake 2.4.0 support - Transformer-provided libraries for Delta Lake stages have been upgraded from Delta Lake version 0.7.0 to version 2.4.0, which supports Spark 3.4.x. This change can have upgrade impact.
- Amazon support for Hadoop 3.3.4 - Transformer provides AWS Transformer-provided libraries for Hadoop 3.3.4 for Amazon S3 stages.
- Deprecated stages - The ADLS Gen1 origin and destination have been deprecated with this release and may be removed in a future release. We recommend that you avoid using these stages. For suggested alternatives, see Deprecated Functionality.
- Library support:
- Additional updates
-
- Docker image upgrades - The Docker images for Transformer 5.7.0 include upgraded Spark versions, as follows:
-
streamsets/transformer:scala-2.11_5.7.0
now uses Spark 2.4.8, upgraded from Spark 2.4.5. -
streamsets/transformer:scala-2.12_5.7.0
now uses Spark 3.4.1, upgraded from Spark 3.0.1.
-
- Ludicrous input metrics - When a pipeline runs in Ludicrous
processing mode, Transformer provides both input and
output statistics.
In previous releases, Transformer provided only output statistics by default, but you could enable the generation of input statistics as needed. With this update, the Collect Input Metrics pipeline property has been removed since input metrics are now always generated in Ludicrous mode.
- Docker image upgrades - The Docker images for Transformer 5.7.0 include upgraded Spark versions, as follows:
Upgrade Impact
- Delta Lake stages that use Transformer-provided libraries
- With this release, the Transformer-provided libraries for Delta Lake stages have been upgraded from Delta Lake version 0.7.0 to version 2.4.0, which supports Spark 3.4.x. Review pipelines that include Delta Lake stages that use Transformer-provided libraries to ensure that the Spark upgrade does not adversely affect data processing.
5.7.0 Fixed Issues
- When you restart Transformer, Transformer does not correctly report the status of pipelines that have been running on MapR and CDH clusters, which can lead to the incorrect belief that the pipelines have stopped. With this fix, Transformer correctly reconnects to MapR and CDH clusters and provides accurate pipeline status reports.
- Transformer does not provide access to Spark driver logs for pipelines that run on EMR Serverless applications, existing EMR clusters, or provisioned EMR clusters that store the driver logs outside of Amazon S3.
5.7.x Known Issues
- If you restart Transformer, then force stop a pipeline that runs on Spark Standalone cluster or a MapR
cluster with security enabled, Transformer can indicate that the pipeline has been stopped even though the pipeline
continues to run.
Workaround: Use Spark or YARN monitoring tools to track and manage those pipelines.
-
When used in a Dataproc 2.1 cluster, the BigQuery origin can cause the pipeline to fail with the following error when the Max Readers property is set to a value between 1-5, inclusive:
preferred_min_stream_count must be less than or equal to max_stream_count
Workaround: When possible, set the Max Readers property to 0 or a value greater than 6.
- Pipelines that run on an EMR Serverless application and contain a Snowflake Lookup processor generate a validation error.
- In Databricks 11.3 clusters, Transformer does not support Oracle 19 databases in JDBC stages.
- The Snowflake destination fails when attempting to create a new table and the
destination is configured as follows:
- Column Mapping Mode property is set to "By Name"
- Write Mode property is set to "Append rows to existing table or create table if none exists"
The pipeline produces the following error:Caused by: net.snowflake.client.jdbc.SnowflakeSQLException: SQL compilation error: Object 'X_DB.X_SCHEMA.X_TABLE' does not exist or not authorized
Workaround: Create the table before running the pipeline.
- Pipelines that run on a Spark 3.x Cloudera Data Engineering cluster can fail when they include a File stage that reads from or writes to HDFS with Kerberos enabled.
- When trying to access Amazon Redshift or Microsoft SQL Server from a Google Dataproc
pipeline, the pipeline fails due to a known issue with the default Dataproc Java security
provider, Conscrypt.
SQL Server pipelines fail with the following error:
Last callback from Transformer Spark application exceeds the specified time configured in 'transformer.driver.max.inactive.interval' : 30000 ms. Check logs for error.
Redshift pipelines fail with the following error:Could not connect to Redshift DB due to error: [JDBC Driver] null
Workaround: Use one of the following methods to disable Conscrypt:
- For existing clusters, on all cluster nodes, edit the
$JAVA_HOME/jre/lib/security/java.security
security configuration file, and disable theorg.conscrypt.OpenSSLProvider
property. - For new clusters, provision clusters with the
dataproc:dataproc.conscrypt.provider.enable
property set tofalse
.
- For existing clusters, on all cluster nodes, edit the
- When previewing a Dataproc pipeline that is configured both to provision and to terminate a cluster, Transformer fails to terminate the cluster when preview completes.
- At this time, you cannot preview data using the cluster manager configured for the pipeline when the pipeline includes Delta Lake, Hive, or MapR Hive stages. Pipelines containing Hive and MapR Hive stages can produce results, but use the metastore URI in the Hive configuration file, ignoring the optional Metastore URI stage property.
- Pipelines that require passing resources such as private key files to Spark executors fail to run on any cluster type except EMR, where this issue has been fixed.
5.6.x Release Notes
The Transformer 5.6.0 release occurred on October 26, 2023.
New Features and Enhancements
- Assume role external ID support
- Amazon S3 stages, EMR pipelines, and EMR Serverless pipelines allow you to specify an external ID to use when assuming another role.
- Azure SQL destination
-
- The destination is available for use with Spark 3.0.x and later, except for version 3.2.x.
- The destination includes the following new
properties for performing a bulk copy:
- Reliability Level
- Isolation Level
- Schema Check
- The Auto-Create Table property, which was available when performing a bulk copy, has been removed. It was unnecessary because the destination always creates a table as needed when performing a bulk copy.
- The Azure SQL destination includes a Microsoft JDBC driver for SQL Server with the destination. The destination uses Microsoft JDBC driver for SQL Server version 8 or later. The destination also requires version 8 or later of the driver starting with this release. This change can have upgrade impact.
- Connections
- EMR and EMR Serverless connections also allow you to specify an external ID to use when assuming another role.
Upgrade Impact
- Update older Microsoft drivers for Azure SQL pipelines
- Starting with 5.6.0, the Azure SQL destination requires a Microsoft JDBC driver for SQL Server version 8 or later. Transformer also includes a Microsoft JDBC driver for SQL Server with the destination, starting with this release.
5.6.0 Fixed Issues
-
Runtime parameters and runtime properties are not correctly evaluated when used to specify the bucket and path in Amazon S3 destinations.
-
Long running pipelines can cause memory issues with Transformer.
-
Sometimes, Force Stop does not kill pipelines that run on Hadoop YARN clusters.
- The Partition Columns property in file-based origins includes an empty partition column option by default.
5.6.x Known Issues
-
When used in a Dataproc 2.1 cluster, the BigQuery origin can cause the pipeline to fail with the following error when the Max Readers property is set to 0 or to a value greater than 6:
preferred_min_stream_count must be less than or equal to max_stream_count
Workaround: When possible, set the Max Readers property to a value between 1 and 5, inclusive.
- Pipelines that run on an EMR Serverless application and contain a Snowflake Lookup processor generate a validation error.
- In Databricks 11.3 clusters, Transformer does not support Oracle 19 databases in JDBC stages.
- The Snowflake destination fails when attempting to create a new table and the
destination is configured as follows:
- Column Mapping Mode property is set to "By Name"
- Write Mode property is set to "Append rows to existing table or create table if none exists"
The pipeline produces the following error:Caused by: net.snowflake.client.jdbc.SnowflakeSQLException: SQL compilation error: Object 'X_DB.X_SCHEMA.X_TABLE' does not exist or not authorized
Workaround: Create the table before running the pipeline.
- Pipelines that run on a Spark 3.x Cloudera Data Engineering cluster can fail when they include a File stage that reads from or writes to HDFS with Kerberos enabled.
- When trying to access Amazon Redshift or Microsoft SQL Server from a Google
Dataproc pipeline, the pipeline fails due to a known issue with the default Dataproc Java security
provider, Conscrypt.
SQL Server pipelines fail with the following error:
Last callback from Transformer Spark application exceeds the specified time configured in 'transformer.driver.max.inactive.interval' : 30000 ms. Check logs for error.
Redshift pipelines fail with the following error:Could not connect to Redshift DB due to error: [JDBC Driver] null
Workaround: Use one of the following methods to disable Conscrypt:
- For existing clusters, on all cluster nodes, edit the
$JAVA_HOME/jre/lib/security/java.security
security configuration file, and disable theorg.conscrypt.OpenSSLProvider
property. - For new clusters, provision clusters with the
dataproc:dataproc.conscrypt.provider.enable
property set tofalse
.
- For existing clusters, on all cluster nodes, edit the
- When previewing a Dataproc pipeline that is configured both to provision and to terminate a cluster, Transformer fails to terminate the cluster when preview completes.
- At this time, you cannot preview data using the cluster manager configured for the pipeline when the pipeline includes Delta Lake, Hive, or MapR Hive stages. Pipelines containing Hive and MapR Hive stages can produce results, but use the metastore URI in the Hive configuration file, ignoring the optional Metastore URI stage property.
- Pipelines that require passing resources such as private key files to Spark executors fail to run on any cluster type except EMR, where this issue has been fixed.
5.5.x Release Notes
The Transformer 5.5.0 release occurred on July 11, 2023.
New Features and Enhancements
- Clusters
-
- Databricks 12.2 support - You can run pipelines on a Databricks 12.2 cluster.
- Dataproc 2.1 support - You can run pipelines on a Google Dataproc 2.1 cluster.
- Stages
-
- Unity Catalog origin - Use the new origin to read from a Databricks Unity Catalog managed table.
- Unity Catalog destination - Use the new destination to write to a Databricks Unity Catalog managed or external table.
- Pipelines
-
- Pipeline retry properties - Use the following properties
to specify how Transformer tries to start pipelines that fail to
start:
- Retry Pipeline on Error - Enables trying to start a pipeline again after it fails to start.
- Retry Attempts - Number of times to try to start a pipeline that fails to start. The default is 3.
- Field selection - When a property includes a magnifying
glass icon, you can use preview data to select fields instead of
typing field names:
- After previewing data, stages provide autocomplete suggestions for field names in properties that accept multiple fields.
- After selecting fields, you can drag the fields to change the order in which they appear.
- In some stages, after you select one or more fields to use, you can then enter an expression that includes the field name or enter additional field names.
- Pipeline retry properties - Use the following properties
to specify how Transformer tries to start pipelines that fail to
start:
- Deprecation and testing update
-
Due to end-of-life declarations, the following cluster versions are deprecated with this release and are no longer tested with Transformer:
- Azure HDInsight 4.0
- CDH 6.x
- Databricks 5.x - 8.x
- HDP
- SQL Server 2019 Big Data Clusters
Upgrade Impact
- Review lookup processor pipelines that sort columns and return the first matching row
- In previous releases, when a lookup processor performs lookups on large data
sets, the lookup fails. This issue is fixed with this release, but requires
lookup processors to use an internal
row_number
column.
5.5.0 Fixed Issues
- Transformer logs include sensitive information in Debug mode.
- Lookups on large data sets can fail.
- The Surrogate Key Generator processor loses track of the last-saved offset after processing an empty batch.
- Transformer cannot locate a separate runtime properties file that has been uploaded as an external resource for the engine.
5.5.x Known Issues
- Pipelines that run on an EMR Serverless application and contain a Snowflake Lookup processor generate a validation error.
- In Databricks 11.3 clusters, Transformer does not support Oracle 19 databases in JDBC stages.
- The Snowflake destination fails when attempting to create a new table and the
destination is configured as follows:
- Column Mapping Mode property is set to "By Name"
- Write Mode property is set to "Append rows to existing table or create table if none exists"
The pipeline produces the following error:Caused by: net.snowflake.client.jdbc.SnowflakeSQLException: SQL compilation error: Object 'X_DB.X_SCHEMA.X_TABLE' does not exist or not authorized
Workaround: Create the table before running the pipeline.
- Pipelines that run on a Spark 3.x Cloudera Data Engineering cluster can fail when they include a File stage that reads from or writes to HDFS with Kerberos enabled.
- When trying to access Amazon Redshift or Microsoft SQL Server from a Google
Dataproc pipeline, the pipeline fails due to a known issue with the default Dataproc Java security
provider, Conscrypt.
SQL Server pipelines fail with the following error:
Last callback from Transformer Spark application exceeds the specified time configured in 'transformer.driver.max.inactive.interval' : 30000 ms. Check logs for error.
Redshift pipelines fail with the following error:Could not connect to Redshift DB due to error: [JDBC Driver] null
Workaround: Use one of the following methods to disable Conscrypt:
- For existing clusters, on all cluster nodes, edit the
$JAVA_HOME/jre/lib/security/java.security
security configuration file, and disable theorg.conscrypt.OpenSSLProvider
property. - For new clusters, provision clusters with the
dataproc:dataproc.conscrypt.provider.enable
property set tofalse
.
- For existing clusters, on all cluster nodes, edit the
- When previewing a Dataproc pipeline that is configured both to provision and to terminate a cluster, Transformer fails to terminate the cluster when preview completes.
- At this time, you cannot preview data using the cluster manager configured for the pipeline when the pipeline includes Delta Lake, Hive, or MapR Hive stages. Pipelines containing Hive and MapR Hive stages can produce results, but use the metastore URI in the Hive configuration file, ignoring the optional Metastore URI stage property.
- Pipelines that require passing resources such as private key files to Spark executors fail to run on any cluster type except EMR, where this issue has been fixed.
5.4.x Release Notes
The Transformer 5.4.0 release occurred on April 27, 2023.
New Features and Enhancements
- Clusters
-
- Amazon EMR - You can specify a cluster using the cluster name and tags, instead of using the cluster ID.
- Amazon EMR Serverless - You can specify an application using the application name and tags, instead of using the application ID.
- Amazon EMR and EMR Serverless - You can configure the following
Transformer configuration properties to define how Transformer retries monitoring EMR and EMR Serverless pipelines:
-
transformer.emr.monitoring.max.retry
-
transformer.emr.monitoring.retry.base.backoff
-
transformer.emr.monitoring.retry.max.backoff
-
- Databricks enhancements:
- You can configure the
transformer.databricks.external.resources.cache
Transformer configuration property to cache runtime resource files for reuse. - You can configure the following Transformer configuration properties to define how Transformer
retries starting Databricks pipelines:
transformer.databricks.run.max.retries
transformer.databricks.run.retry.interval
- When needed, you can configure Transformer to lock the Databricks workspace when it starts a pipeline. This can prevent timeout errors when multiple Transformer engines try to start pipelines that require uploading a large volume of runtime resource files to staging areas.
- Amazon EMR, Amazon EMR Serverless, Databricks, and Dataproc clusters - When you update an external library used by these clusters for Transformer, Transformer can update the external libraries in the cluster staging directories, so you do not need to manually remove older versions.
- You can configure the
- Connections
-
- Amazon EMR - You can specify a cluster using the cluster name and tags, instead of using the cluster ID. Available with authoring Data Collector 5.5.0 or later.
- Amazon EMR Serverless - You can specify an application using the application name and tags, instead of using the application ID. Available with authoring Data Collector 5.5.0 or later.
- Additional enhancements
-
- Transformer Docker image - The Docker images for Transformer 5.4.0,
streamsets/transformer:scala-2.11_5.4.0
andstreamsets/transformer:scala-2.12_5.4.0
, use Ubuntu 22.04 LTS (Jammy Jellyfish) as a parent image. This change can have upgrade impact.
- Transformer Docker image - The Docker images for Transformer 5.4.0,
Upgrade Impact
- Review Dockerfiles for custom Docker images
In previous releases, Transformer Docker images used Alpine Linux as a parent image. Due to limitations in Alpine Linux, with this release Transformer Docker images use Ubuntu 22.04 LTS (Jammy Jellyfish) as the parent image.
If you build custom Transformer images with earlier releases ofstreamsets/transformer
as the parent image, review your Dockerfiles. Make all required updates so they are compatible with Ubuntu Jammy Jellyfish before you build a custom image based onstreamsets/transformer:scala-2.11_5.4.0
orstreamsets/transformer:scala-2.12_5.4.0
.
5.4.0 Fixed Issues
- When returning an empty dataframe, some processors, such as the Filter processor, do not include the schema as expected.
- The Surrogate Key Generator processor restarts key generation using the configured initial value after receiving an empty dataframe.
- When a provisioned pipeline completes successfully on Databricks,
Databricks can incorrectly display a
Cancelled
orFailed
status instead of aSucceeded
status.
5.4.x Known Issues
- Pipelines that run on an EMR Serverless application and contain a Snowflake Lookup processor generate a validation error.
- In Databricks 11.3 clusters, Transformer does not support Oracle 19 databases in JDBC stages.
- Transformer cannot locate a separate runtime properties file that has been uploaded as an
external resource for the engine.
Workaround: Define runtime properties in the Transformer configuration properties instead of in a separate runtime properties file.
- The Snowflake destination fails when attempting to create a new table and the
destination is configured as follows:
- Column Mapping Mode property is set to "By Name"
- Write Mode property is set to "Append rows to existing table or create table if none exists"
The pipeline produces the following error:Caused by: net.snowflake.client.jdbc.SnowflakeSQLException: SQL compilation error: Object 'X_DB.X_SCHEMA.X_TABLE' does not exist or not authorized
Workaround: Create the table before running the pipeline.
- Pipelines that run on a Spark 3.x Cloudera Data Engineering cluster can fail when they include a File stage that reads from or writes to HDFS with Kerberos enabled.
- When trying to access Amazon Redshift or Microsoft SQL Server from a Google
Dataproc pipeline, the pipeline fails due to a known issue with the default Dataproc Java security
provider, Conscrypt.
SQL Server pipelines fail with the following error:
Last callback from Transformer Spark application exceeds the specified time configured in 'transformer.driver.max.inactive.interval' : 30000 ms. Check logs for error.
Redshift pipelines fail with the following error:Could not connect to Redshift DB due to error: [JDBC Driver] null
Workaround: Use one of the following methods to disable Conscrypt:
- For existing clusters, on all cluster nodes, edit the
$JAVA_HOME/jre/lib/security/java.security
security configuration file, and disable theorg.conscrypt.OpenSSLProvider
property. - For new clusters, provision clusters with the
dataproc:dataproc.conscrypt.provider.enable
property set tofalse
.
- For existing clusters, on all cluster nodes, edit the
- When previewing a Dataproc pipeline that is configured both to provision and to terminate a cluster, Transformer fails to terminate the cluster when preview completes.
- At this time, you cannot preview data using the cluster manager configured for the pipeline when the pipeline includes Delta Lake, Hive, or MapR Hive stages. Pipelines containing Hive and MapR Hive stages can produce results, but use the metastore URI in the Hive configuration file, ignoring the optional Metastore URI stage property.
- Pipelines that require passing resources such as private key files to Spark executors fail to run on any cluster type except EMR, where this issue has been fixed.
5.3.x Release Notes
The Transformer 5.3.0 release occurred on February 17, 2023.
New Features and Enhancements
- Clusters
-
- Amazon EMR Serverless - Transformer can run pipelines on an Amazon EMR Serverless application.
- Databricks clusters
- You can include the Google Big Query origin and destination and the Google Cloud Storage origin and destination in pipelines that you run on a Databricks cluster.
- You can run pipelines on a Databricks 11.3 cluster.
- Hadoop YARN clusters - You can run pipelines on a MapR 7.0 distribution of Hadoop YARN. The distribution requires Ezmeral Ecosystem Pack (EEP) 8.1, which includes Spark 3.2.0. Map R is now called HPE Ezmeral Data Fabric.
- Stages
-
- Data Formats property - The Whole Directory and MapR FS origins have an Additional Data Format Configuration property on the Data Formats tab. You can use this property to enter other data format parameters.
- Google stages - You can include the Google Big Query origin and destination and the Google Cloud Storage origin and destination in pipelines that you run on a Databricks cluster.
- Slowly Changing
Dimension processor - Improved logic produces consistent
and expected results in the following existing features:
- Null handling - When enabled, the processor replaces null values in change records.
- Records without changes - The processor discards and change record that does not contain a change.
- Tracking fields - Change records do not require tracking fields.
- Timestamp basis field - The processor supports three
categories for the timestamp basis field:
- Same name as tracking field
- Data field in master dimension
- Extra field in change record
- Non-Type 2 updates - In a Type 2 dimension, a change record can trigger a non-Type 2 update.
- New records - For change records without a matching master record, the processor inserts a new record
- Connections
-
- Amazon EMR Serverless - You can use an Amazon EMR Serverless connection when configuring a pipeline to run on an Amazon EMR Serverless application.
- Additional enhancements
-
- Thycotic Secret Server credential store - You can use Thycotic Secret Server as a credential store for Transformer.
- Support bundles - You can
generate a support bundle when Transformer uses the default
WebSocket tunneling communication method.
Earlier Transformer versions require using direct engine REST APIs to generate support bundles.
- Documentation terminology
-
- In the documentation that discusses slowly changing file dimensions, the terms “grouped file dimension” and “ungrouped file dimension” replace the terms “partitioned file dimension” and “unpartitioned file dimension.”
Upgrade Impact
- Review account types
- With this release, Transformer no longer supports StreamSets accounts. If you were using a StreamSets account with Transformer, switch to a different account type.
- Manage underscores in Snowflake connection information
- Starting with the Snowflake JDBC driver 3.13.25
release in November 2022, the Snowflake JDBC driver converts underscores to hyphens,
by default.
This can adversely affect communicating with Snowflake when Snowflake connection information specified in a Snowflake stage or connection, such as a URL, includes underscores. When needed, you can bypass this behavior by setting the
allowUnderscoresInHost
driver property totrue
. For more information and alternate solutions, see this Snowflake community article.
5.3.0 Fixed Issues
- In a PySpark processor, you cannot create views from inputs and then run queries on those views.
- If a stage contains a property that can use batch functions, such as the Directory Path property of the File destination, and that stage references a pipeline parameter, then evaluation of the pipeline parameter fails.
- The Delta Lake destination generates an error when using multiple merge keys in an upsert operation.
5.3.x Known Issues
- Pipelines that run on an EMR Serverless application and contain a Snowflake Lookup processor generate a validation error.
- In Databricks 11.3 clusters, Transformer does not support Oracle 19 databases in JDBC stages.
- Transformer cannot locate a separate runtime properties file that has been uploaded as an
external resource for the engine.
Workaround: Define runtime properties in the Transformer configuration properties instead of in a separate runtime properties file.
- The Snowflake destination fails when attempting to create a new table and the
destination is configured as follows:
- Column Mapping Mode property is set to "By Name"
- Write Mode property is set to "Append rows to existing table or create table if none exists"
The pipeline produces the following error:Caused by: net.snowflake.client.jdbc.SnowflakeSQLException: SQL compilation error: Object 'X_DB.X_SCHEMA.X_TABLE' does not exist or not authorized
Workaround: Create the table before running the pipeline.
- Pipelines that run on a Spark 3.x Cloudera Data Engineering cluster can fail when they include a File stage that reads from or writes to HDFS with Kerberos enabled.
- When trying to access Amazon Redshift or Microsoft SQL Server from a Google
Dataproc pipeline, the pipeline fails due to a known issue with the default Dataproc Java security
provider, Conscrypt.
SQL Server pipelines fail with the following error:
Last callback from Transformer Spark application exceeds the specified time configured in 'transformer.driver.max.inactive.interval' : 30000 ms. Check logs for error.
Redshift pipelines fail with the following error:Could not connect to Redshift DB due to error: [JDBC Driver] null
Workaround: Use one of the following methods to disable Conscrypt:
- For existing clusters, on all cluster nodes, edit the
$JAVA_HOME/jre/lib/security/java.security
security configuration file, and disable theorg.conscrypt.OpenSSLProvider
property. - For new clusters, provision clusters with the
dataproc:dataproc.conscrypt.provider.enable
property set tofalse
.
- For existing clusters, on all cluster nodes, edit the
- When previewing a Dataproc pipeline that is configured both to provision and to terminate a cluster, Transformer fails to terminate the cluster when preview completes.
- At this time, you cannot preview data using the cluster manager configured for the pipeline when the pipeline includes Delta Lake, Hive, or MapR Hive stages. Pipelines containing Hive and MapR Hive stages can produce results, but use the metastore URI in the Hive configuration file, ignoring the optional Metastore URI stage property.
- Pipelines that require passing resources such as private key files to Spark executors fail to run on any cluster type except EMR, where this issue has been fixed.
5.2.x Release Notes
The Transformer 5.2.0 release occurred on October 27, 2022.
New Features and Enhancements
- New stage
- XML Parser processor - You can use the new XML Parser processor to parse an XML document in a string field and pass the parsed data to a map field.
- Database table origins
- The MySQL JDBC Table, Oracle JDBC Table, PostgreSQL JDBC Table, and
SQL Server JDBC Table
origins include the following enhancements:
- Support for multiple tables - These origins can read from multiple tables. The Table property in previous releases has become the Tables property. With one of these origins configured to read from multiple tables, a pipeline processes multiple batches, in both batch and streaming mode.
- Batch headers - Pipelines with these origins generate batch headers
for each batch and the origin writes the
jdbc.table
attribute in the header. The attribute stores the name of the table that the origin reads for the batch.
- Other stage enhancements
-
- Data Formats property - The ADLS Gen1, ADLS Gen2, Amazon S3, File, and Google Cloud Storage origins have an Additional Data Format Configuration property on the Data Formats tab. You can use this property to enter other data format parameters.
- File-based destinations - The ADLS Gen1, ADLS Gen2, Amazon S3, File, and Google Cloud Storage destinations generate a validation error when writing in the Text data format if the Partition by Fields property is enabled.
- Security option for Kafka stages - Kafka stages support the Custom Authentication (Security Protocol=CUSTOM) security option. Use the option to specify custom properties that contain the information required by a security protocol, rather than using predefined properties associated with other security options.
- Snowflake destination - When merging data, the destination supports multiple join keys.
- Clusters
-
- Amazon EMR clusters - You can set the Amazon EMR runtime role with the Execution Role property.
- Google Dataproc clusters
- In pipelines running on Dataproc 2.0.40 clusters, you can include the Amazon Redshift destination and the Avro data format in Amazon S3 stages and Google Cloud Storage stages.
- You can access details about the Dataproc job run for a
Transformer pipeline in one of the following ways:
- After the corresponding Control Hub job completes, view the job run summary from the job History tab which displays the Dataproc Job URL. Use the URL to access the Dataproc job in the Google Cloud Console.
- Log into the Google Cloud Console and view the list
of jobs run on the Dataproc cluster. Filter the jobs
by the
streamsets-transformer-pipeline-id
orstreamsets-transformer-pipeline-name
label which are applied to all Dataproc jobs run for Transformer pipelines.
- Additional enhancements
-
- Expression language - Batch functions retrieve the value of an attribute in a batch header. You can use the functions in specific properties of destination stages.
- Credential stores
- Google Secret Manager - You can use Google Secret Manager as a credential store for Transformer.
- Property for migration - You can use a new
credentialStores.usePortableGroups
credential stores property to migrate pipelines that access credential stores from one Control Hub organization to another. Contact StreamSets Support before enabling this option.
Upgrade Impact
- Review use of generated XML files
- In Transformer 5.2.0 prebuilt with Scala 2.12, the XML files that
destinations write include the following initial XML
declaration:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
If you use Transformer prebuilt with Scala 2.12, then after you upgrade to 5.2.0, review how pipelines use the generated files and make necessary changes to account for the new initial declaration.
- Update SDK code for database-vendor-specific JDBC origins
- In Transformer 5.2.0, four origins – MySQL JDBC Table, Oracle JDBC Table,
PostgreSQL JDBC Table, and SQL Server JDBC Table – replace the Table
property that accepted a single table with the Tables property that accepts
a list of multiple tables.
After you upgrade to 5.2.0, review your SDK for Python code for these origins and replace
origin.table
withorigin.tables
.
5.2.0 Fixed Issues
- File-based stages in Transformer do not work correctly with Avro files in Spark 2.4.7 clusters.
- Some Databricks cluster configuration parameters for provisioning are ignored.
- Pipelines running on a Dataproc cluster fail, unable to access or create a directory specified in a resource file.
5.2.x Known Issues
- Transformer cannot locate a separate runtime properties file that has been uploaded as an
external resource for the engine.
Workaround: Define runtime properties in the Transformer configuration properties instead of in a separate runtime properties file.
- The Snowflake destination fails when attempting to create a new table and the
destination is configured as follows:
- Column Mapping Mode property is set to "By Name"
- Write Mode property is set to "Append rows to existing table or create table if none exists"
The pipeline produces the following error:Caused by: net.snowflake.client.jdbc.SnowflakeSQLException: SQL compilation error: Object 'X_DB.X_SCHEMA.X_TABLE' does not exist or not authorized
Workaround: Create the table before running the pipeline.
- In a PySpark processor, you cannot create views from inputs and then run queries
on those views.For example, you cannot use the following code:
inputs[0].createOrReplaceTempView('test999') output = spark.sql('select * from test999')
Workaround: Recreate the DataFrame in the Python Spark session.
For example:df = spark.createDataFrame(inputs[0].rdd) df.createOrReplaceTempView("test999") output = spark.sql('select * from test999')
- Pipelines that run on a Spark 3.x Cloudera Data Engineering cluster can fail when they include a File stage that reads from or writes to HDFS with Kerberos enabled.
- When trying to access Amazon Redshift or Microsoft SQL Server from a Google
Dataproc pipeline, the pipeline fails due to a known issue with the default Dataproc Java security
provider, Conscrypt.
SQL Server pipelines fail with the following error:
Last callback from Transformer Spark application exceeds the specified time configured in 'transformer.driver.max.inactive.interval' : 30000 ms. Check logs for error.
Redshift pipelines fail with the following error:Could not connect to Redshift DB due to error: [JDBC Driver] null
Workaround: Use one of the following methods to disable Conscrypt:
- For existing clusters, on all cluster nodes, edit the
$JAVA_HOME/jre/lib/security/java.security
security configuration file, and disable theorg.conscrypt.OpenSSLProvider
property. - For new clusters, provision clusters with the
dataproc:dataproc.conscrypt.provider.enable
property set tofalse
.
- For existing clusters, on all cluster nodes, edit the
- When previewing a Dataproc pipeline that is configured both to provision and to terminate a cluster, Transformer fails to terminate the cluster when preview completes.
- At this time, you cannot preview data using the cluster manager configured for the pipeline when the pipeline includes Delta Lake, Hive, or MapR Hive stages. Pipelines containing Hive and MapR Hive stages can produce results, but use the metastore URI in the Hive configuration file, ignoring the optional Metastore URI stage property.
- Pipelines that require passing resources such as private key files to Spark executors fail to run on any cluster type except EMR, where this issue has been fixed.
5.1.x Release Notes
The Transformer 5.1.0 release occurred on July 26, 2022.
New Features and Enhancements
- Cluster support
-
- Databricks 10.4 - You can run pipelines on Databricks 10.4 clusters.
- Dataproc 2.0.40 - You can run pipelines on Dataproc 2.0.40 clusters. However, with this version pipelines cannot include the Amazon Redshift destination or the Avro data format in Amazon S3 stages or Google Cloud Storage stages.
- Database table origins
- This release includes the following enhancements for the MySQL JDBC Table,
Oracle JDBC Table, PostgreSQL JDBC Table, and SQL Server JDBC Table
origins:
- Maximum number of partitions - The Number of Partitions property in
previous releases has become the Maximum Number of Partitions
property. When the pipeline runs, the origin creates up to the
specified number of partitions. This allows the pipeline to run if
the origin cannot create the specified number of partitions.
In previous versions, if an origin cannot create the specified number of partitions, the pipeline fails.
- Automatic partition selection - When you configure an origin to skip
offset tracking, the origin attempts to select a logical partition
column for the read. For more information, see “Partition Column
Selection” in the documentation for the origin.
In previous versions, when an offset column could not be found to be used for partitioning, the pipeline fails.
- Maximum number of partitions - The Number of Partitions property in
previous releases has become the Maximum Number of Partitions
property. When the pipeline runs, the origin creates up to the
specified number of partitions. This allows the pipeline to run if
the origin cannot create the specified number of partitions.
- Runtime parameters
- You can use runtime parameters to represent a stage or pipeline property that displays as a list of configurations. For example, you can use a runtime parameter to define the Additional JDBC Configuration Properties for the JDBC Table origins.
5.1.x Known Issues
- Pipelines that run on a Spark 3.x Cloudera Data Engineering cluster can fail when they include a File stage that reads from or writes to HDFS with Kerberos enabled.
- When trying to access Amazon Redshift or Microsoft SQL Server from a Google
Dataproc pipeline, the pipeline fails due to a known issue with the default Dataproc Java security
provider, Conscrypt.
SQL Server pipelines fail with the following error:
Last callback from Transformer Spark application exceeds the specified time configured in 'transformer.driver.max.inactive.interval' : 30000 ms. Check logs for error.
Redshift pipelines fail with the following error:Could not connect to Redshift DB due to error: [JDBC Driver] null
Workaround: Use one of the following methods to disable Conscrypt:
- For existing clusters, on all cluster nodes, edit the
$JAVA_HOME/jre/lib/security/java.security
security configuration file, and disable theorg.conscrypt.OpenSSLProvider
property. - For new clusters, provision clusters with the
dataproc:dataproc.conscrypt.provider.enable
property set tofalse
.
- For existing clusters, on all cluster nodes, edit the
- When previewing a Dataproc pipeline that is configured both to provision and to terminate a cluster, Transformer fails to terminate the cluster when preview completes.
- At this time, you cannot preview data using the cluster manager configured for the pipeline when the pipeline includes Delta Lake, Hive, or MapR Hive stages. Pipelines containing Hive and MapR Hive stages can produce results, but use the metastore URI in the Hive configuration file, ignoring the optional Metastore URI stage property.
- Pipelines that require passing resources such as private key files to Spark executors fail to run on any cluster type except EMR, where this issue has been fixed.
- Kafka stages include a
Custom Authentication (Security Protocol=CUSTOM)
option for the Security Option property and related custom security properties that are not yet supported. Do not use the custom security option or specify custom security properties. When defined, custom security properties are ignored.
5.0.x Release Notes
The Transformer 5.0.0 release occurred on May 30, 2022.
New Features and Enhancements
- Stage enhancements
-
- Snowflake stages property rename and enhancement - The
Additional Snowflake Configuration Properties property is now named
Connection Properties and is moved from the Advanced tab to the
Connection tab. In addition, you can specify credential functions in the
property value to retrieve secrets stored in a credential store. This
change affects the following stages:
- Snowflake origin
- Snowflake Lookup processor
- Snowflake destination
- Snowflake stages property rename and enhancement - The
Additional Snowflake Configuration Properties property is now named
Connection Properties and is moved from the Advanced tab to the
Connection tab. In addition, you can specify credential functions in the
property value to retrieve secrets stored in a credential store. This
change affects the following stages:
- Transformer logs
- Transformer uses the Apache Log4j 2.17.2 library to write log data. In previous releases, Transformer used the Apache Log4j 1.x library which is now end-of-life.
- Proxy server configuration
- To configure Transformer to use a proxy server for outbound network requests, define proxy properties when you set up the deployment.
5.0.0 Fixed Issues
- You cannot preview or validate a pipeline using embedded Spark libraries.
- Transformer 4.0.0 or later cannot load runtime resources for a pipeline running on a Hadoop YARN Cloudera distribution cluster.
5.0.x Known Issues
- Pipelines that run on a Spark 3.x Cloudera Data Engineering cluster can fail when they include a File stage that reads from or writes to HDFS with Kerberos enabled.
- When trying to access Amazon Redshift or Microsoft SQL Server from a Google Dataproc
pipeline, the pipeline fails due to a known issue with the default Dataproc Java security
provider, Conscrypt.
SQL Server pipelines fail with the following error:
Last callback from Transformer Spark application exceeds the specified time configured in 'transformer.driver.max.inactive.interval' : 30000 ms. Check logs for error.
Redshift pipelines fail with the following error:Could not connect to Redshift DB due to error: [JDBC Driver] null
Workaround: Use one of the following methods to disable Conscrypt:
- For existing clusters, on all cluster nodes, edit the
$JAVA_HOME/jre/lib/security/java.security
security configuration file, and disable theorg.conscrypt.OpenSSLProvider
property. - For new clusters, provision clusters with the
dataproc:dataproc.conscrypt.provider.enable
property set tofalse
.
- For existing clusters, on all cluster nodes, edit the
- Due to memory issues in older Databricks clusters, communication failures can occur
when running pipelines on those clusters. The memory issues can generate error
messages such as the
following:
Failed to start the pipeline, Databricks Cluster Driver is up but is not responsive, likely due to GC. GC overhead limit exceeded Driver Error: Last callback from Transformer Spark application exceeds the specified time configured in 'transformer.driver.max.inactive.interval'
Workaround: To address the memory issues, try one or both of the following solutions:- Fine tune the Spark configuration properties related to memory, such as
spark.driver.memory
,spark.driver.cores
,spark.executor.memory
, andspark.executor.cores
. - Increase the memory on Spark cluster nodes.
- Fine tune the Spark configuration properties related to memory, such as
- When previewing a Dataproc pipeline that is configured both to provision and to terminate a cluster, Transformer fails to terminate the cluster when preview completes.
- A successful Transformer
pipeline run on a provisioned Databricks cluster displays a Cancelled or Failed
status in Databricks.
Workaround: When the Databricks job status for a pipeline on a provisioned Databricks cluster differs from job status in Control Hub, trust the status reported by Control Hub.
- At this time, you cannot preview data using the cluster manager configured for the pipeline when the pipeline includes Delta Lake, Hive, or MapR Hive stages. Pipelines containing Hive and MapR Hive stages can produce results, but use the metastore URI in the Hive configuration file, ignoring the optional Metastore URI stage property.
- Starting a cluster pipeline multiple times, in quick succession, can cause the
pipeline to hang with the following error:
Transformer Spark app is already running. Waiting for callback...
Workaround: Wait a few seconds before starting the pipeline again.
- Pipelines that require passing resources such as private key files to Spark executors fail to run on any cluster type except EMR, where this issue has been fixed.
- The Google BigQuery destination fails to write numeric data with a scale greater than 9 to BigQuery columns defined as BIGNUMERIC data type.
-
Restarting a cluster pipeline shortly after starting it can cause the cluster to use the same Spark application ID for both pipeline runs, leading to errors.
Workaround: Wait a few seconds before starting the pipeline again.
- The File origin processes files with mixed schemas.
4.3.x Release Notes
The Transformer 4.3.0 release occurred on April 29, 2022.
New Features and Enhancements
- Cluster support
-
- Cloudera Data Engineering cluster support - Transformer now supports running pipelines on Cloudera Data Engineering virtual clusters.
- Cloudera CDP Private Cloud Base 7.1.x support - Transformer now supports Hadoop YARN clusters on Cloudera CDP Private Cloud Base 7.1.x.
- EMR connection retry properties - You can configure the
following new properties to define how Transformer retries a failed request or throttling error for an EMR cluster:
- Max Retries
- Retry Base Delay
- Throttling Retry Base Delay
- Max Backoff
- New stage
-
- New JSON Parser processor - Use the JSON Parser processor to parse a JSON object embedded in a string field.
- Stage enhancements
-
- Amazon S3 origin property rename - The Bucket property is now named Bucket and Path. It has always allowed entering a path that includes the asterisk (*) and question mark (?) wildcards.
- New empty dataframe behavior for JDBC Table origins - When there is
no data to be read, the following origins now pass the table schema
in an empty dataframe:
- JDBC Table origin
- MySQL JDBC Table origin
- Oracle JDBC Table origin
- PostgreSQL JDBC Table origin
In previous releases, these origins passed an empty schema with empty dataframes. This change has no upgrade impact because it includes a new Use Empty Schemas property that passes an empty schema with empty dataframes.
To preserve backward compatibility, the Use Empty Schemas property is enabled for all upgraded pipelines. For new pipelines, this property is disabled by default.
- Partition Base Path origin property - The following origins now
allow specifying a base path for partitions in a Partition Base Path
property:
- ADLS Gen1 origin
- ADLS Gen2 origin
- Amazon S3 origin
- File origin
- Google Cloud Storage origin
- MapR FS origin
- Skip Empty Batches destination property - The following destinations
can now skip writing empty batches when you select the Skip Empty
Batches property:
- ADLS Gen1 destination
- ADLS Gen2 destination
- Amazon S3 destination
- File destination
- Google Cloud Storage destination
- MapR FS destination
4.3.0 Fixed Issues
- When a job has failover enabled and the Transformer that runs the pipeline fails, and the same Transformer attempts to recover the pipeline, the pipeline enters a
RUN_ERROR
state and fails with the following error message:RUN_ERROR:Error while trying to check the status of disconnected driver pipeline: st.shaded.org.glassfish.jersey.message.internal.MessageBodyProviderNotFoundException: MessageBodyReader not found for media type=text/html;charset=utf-8, type=interface com.streamsets.datacollector.execution.PipelineState, genericType=interface com.streamsets.datacollector.execution.PipelineState.
4.3.x Known Issues
- You cannot preview or validate a pipeline using embedded Spark libraries.
- Pipelines that run on a Spark 3.x Cloudera Data Engineering cluster can fail when they include a File stage that reads from or writes to HDFS with Kerberos enabled.
- When trying to access Amazon Redshift or Microsoft SQL Server from a Google
Dataproc pipeline, the pipeline fails due to a known issue with the default Dataproc Java security
provider, Conscrypt.
SQL Server pipelines fail with the following error:
Last callback from Transformer Spark application exceeds the specified time configured in 'transformer.driver.max.inactive.interval' : 30000 ms. Check logs for error.
Redshift pipelines fail with the following error:Could not connect to Redshift DB due to error: [JDBC Driver] null
Workaround: Use one of the following methods to disable Conscrypt:
- For existing clusters, on all cluster nodes, edit the
$JAVA_HOME/jre/lib/security/java.security
security configuration file, and disable theorg.conscrypt.OpenSSLProvider
property. - For new clusters, provision clusters with the
dataproc:dataproc.conscrypt.provider.enable
property set tofalse
.
- For existing clusters, on all cluster nodes, edit the
- Due to memory issues in older Databricks clusters, communication failures can
occur when running pipelines on those clusters. The memory issues can generate
error messages such as the
following:
Failed to start the pipeline, Databricks Cluster Driver is up but is not responsive, likely due to GC. GC overhead limit exceeded Driver Error: Last callback from Transformer Spark application exceeds the specified time configured in 'transformer.driver.max.inactive.interval'
Workaround: To address the memory issues, try one or both of the following solutions:- Fine tune the Spark configuration properties related to memory, such
as
spark.driver.memory
,spark.driver.cores
,spark.executor.memory
, andspark.executor.cores
. - Increase the memory on Spark cluster nodes.
- Fine tune the Spark configuration properties related to memory, such
as
- When previewing a Dataproc pipeline that is configured both to provision and to terminate a cluster, Transformer fails to terminate the cluster when preview completes.
- A successful Transformer pipeline run on a provisioned Databricks cluster displays a Cancelled or
Failed status in Databricks.
Workaround: When the Databricks job status for a pipeline on a provisioned Databricks cluster differs from job status in Control Hub, trust the status reported by Control Hub.
- At this time, you cannot preview data using the cluster manager configured for the pipeline when the pipeline includes Delta Lake, Hive, or MapR Hive stages. Pipelines containing Hive and MapR Hive stages can produce results, but use the metastore URI in the Hive configuration file, ignoring the optional Metastore URI stage property.
- Starting a cluster pipeline multiple times, in quick succession, can cause the
pipeline to hang with the following error:
Transformer Spark app is already running. Waiting for callback...
Workaround: Wait a few seconds before starting the pipeline again.
- Pipelines that require passing resources such as private key files to Spark executors fail to run on any cluster type except EMR, where this issue has been fixed.
- The Google BigQuery destination fails to write numeric data with a scale greater than 9 to BigQuery columns defined as BIGNUMERIC data type.
-
Restarting a cluster pipeline shortly after starting it can cause the cluster to use the same Spark application ID for both pipeline runs, leading to errors.
Workaround: Wait a few seconds before starting the pipeline again.
- The File origin processes files with mixed schemas.
4.2.x Release Notes
The Transformer 4.2.0 release occurred on January 21, 2022.
New Features and Enhancements
- Internal update
- This release includes internal updates to support an upcoming Control Hub feature on the StreamSets platform.
- Clusters
-
- Amazon EMR:
- You can specify a SSH EC2 Key ID property for the EC2 SSH key to be used on nodes of the cluster.
- You can configure pipelines on an EMR cluster to assume another role.
- Databricks:
- You can run pipelines on Databricks 9.1 LTS clusters.
- You can access the Databricks job details when you view a job run summary from the job History tab.
- Google Dataproc 2.0 - You can run pipelines on Google Dataproc 2.0.
- Amazon EMR:
- Stage enhancements
-
- Amazon S3 destination - When configuring the
destination, you can now use the
s3
URI scheme, in addition to thes3a
scheme. Best practice is to uses3
with EMR clusters ands3a
with all other clusters. - Field Replacer processor - Use Spark SQL expressions to generate new values for specified fields. You can use quotation marks to specify a string.
- Google Big Query origin - The origin can now read from Google BigQuery views.
- Amazon S3 destination - When configuring the
destination, you can now use the
- Connections
-
- Amazon EMR cluster connections include the following
enhancements:
- You can configure Amazon EMR cluster connections to assume another role.
- You can specify a SSH EC2 Key ID property for the EC2 SSH key to be used on nodes of the cluster.
- Amazon EMR cluster connections include the following
enhancements:
- Deprecation and testing update
-
- The Cloudera CDH 5.x stage libraries are now deprecated. As a result, StreamSets no longer tests Transformer against Cloudera CDH 5.x.
- Additional enhancements
-
- CyberArk credential store support - You can use CyberArk as a credential store for Transformer.
- Cluster URL access - When monitoring a Control Hub job for a Databricks pipeline, when you view the job summary, you can now access the Databricks cluster job URL.
4.2.0 Fixed Issues
- Redshift
destinations fail to write partitioned data when running on Databricks cluster
versions 7.x and later. The pipeline fails with the following error:
java.sql.SQLException: Invalid operation: Mandatory url is not present in manifest file.
- The Scala processor always checks if a batch is empty instead of checking only when the Skip Empty Batches property is enabled. This slows performance.
- Runtime resources are not accessible from Transformer pipelines.
- When provisioning a Databricks cluster, the user-defined tags defined in cluster configuration properties are not being set.
- When provisioning
a Databricks cluster, the
policy_id
parameter defined in the cluster configuration properties is ignored. - For pipelines run on Databricks clusters, resources are staged in the EBS volumes instead of the Databricks distributed file system (DBFS), and the resources are not being removed when no longer needed.
4.2.x Known Issues
- As noted in the StreamSets
Technical Service Bulletin, Transformer 3.12.0 and later are not vulnerable to the Apache Log4j zero-day
vulnerability documented in CVE-2021-44228.
However, StreamSets highly recommends that you update all clusters that run Transformer pipelines to protect against the zero-day vulnerability.
- When trying to
access Amazon Redshift or Microsoft SQL Server from a Google Dataproc pipeline,
the pipeline fails due to a known issue with the default Dataproc Java security
provider, Conscrypt.
SQL Server pipelines fail with the following error:
Last callback from Transformer Spark application exceeds the specified time configured in 'transformer.driver.max.inactive.interval' : 30000 ms. Check logs for error.
Redshift pipelines fail with the following error:Could not connect to Redshift DB due to error: [JDBC Driver] null
Workaround: Use one of the following methods to disable Conscrypt:
- For existing clusters, on all cluster nodes, edit the
$JAVA_HOME/jre/lib/security/java.security
security configuration file, and disable theorg.conscrypt.OpenSSLProvider
property. - For new clusters, provision clusters with the
dataproc:dataproc.conscrypt.provider.enable
property set tofalse
.
- For existing clusters, on all cluster nodes, edit the
- Due to memory
issues in older Databricks clusters, communication failures can occur when
running pipelines on those clusters. The memory issues can generate error
messages such as the
following:
Failed to start the pipeline, Databricks Cluster Driver is up but is not responsive, likely due to GC. GC overhead limit exceeded Driver Error: Last callback from Transformer Spark application exceeds the specified time configured in 'transformer.driver.max.inactive.interval'
Workaround: To address the memory issues, try one or both of the following solutions:- Fine tune the Spark configuration properties related to memory, such
as
spark.driver.memory
,spark.driver.cores
,spark.executor.memory
, andspark.executor.cores
. - Increase the memory on Spark cluster nodes.
- Fine tune the Spark configuration properties related to memory, such
as
- When previewing a Dataproc pipeline that is configured both to provision and to terminate a cluster, Transformer fails to terminate the cluster when preview completes.
- When a job has
failover enabled and the Transformer that runs the pipeline fails, and the same Transformer attempts to recover the pipeline, the pipeline enters a
RUN_ERROR
state and fails with the following error message:RUN_ERROR:Error while trying to check the status of disconnected driver pipeline: st.shaded.org.glassfish.jersey.message.internal.MessageBodyProviderNotFoundException: MessageBodyReader not found for media type=text/html;charset=utf-8, type=interface com.streamsets.datacollector.execution.PipelineState, genericType=interface com.streamsets.datacollector.execution.PipelineState.
- A successful Transformer pipeline run on a provisioned Databricks cluster displays a Cancelled or
Failed status in Databricks.
Workaround: When the Databricks job status for a pipeline on a provisioned Databricks cluster differs from job status in Control Hub, trust the status reported by Control Hub.
- At this time, you cannot preview data using the cluster manager configured for the pipeline when the pipeline includes Delta Lake, Hive, or MapR Hive stages. Pipelines containing Hive and MapR Hive stages can produce results, but use the metastore URI in the Hive configuration file, ignoring the optional Metastore URI stage property.
- Starting a
cluster pipeline multiple times, in quick succession, can cause the pipeline to
hang with the following error:
Transformer Spark app is already running. Waiting for callback...
Workaround: Wait a few seconds before starting the pipeline again.
- Pipelines that require passing resources such as private key files to Spark executors fail to run on any cluster type except EMR, where this issue has been fixed.
- The Google BigQuery destination fails to write numeric data with a scale greater than 9 to BigQuery columns defined as BIGNUMERIC data type.
-
Restarting a cluster pipeline shortly after starting it can cause the cluster to use the same Spark application ID for both pipeline runs, leading to errors.
Workaround: Wait a few seconds before starting the pipeline again.
- The File origin processes files with mixed schemas.
4.1.x Release Notes
The Transformer 4.1.0 release occurred on September 27, 2021.
New Features and Enhancements
- Clusters
-
- Amazon EMR:
- AWS tags - When provisioning an Amazon EMR cluster, you can specify AWS tags for the cluster.
- Regions - You can now specify additional regions for EMR clusters.
- Google Dataproc:
- Dataproc labels - When provisioning a Google Dataproc cluster, you can specify Dataproc labels for the cluster.
- Credentials files - You can now specify relative paths in addition to absolute paths to service account credentials files.
- Regions - You can now specify additional regions for Dataproc clusters.
- Databricks:
- Job submission - When you start a pipeline,
Transformer now submits the pipeline to a Databricks cluster
directly as a workload, creating an ephemeral job.
Previously, Transformer created one-time jobs, which counted against the job limit on the account. Ephemeral jobs do not count towards the job limit.Note: The details of ephemeral jobs do not display with regular jobs through the Databricks job menu. For details, see Upgrade Impact.
- Init script enhancement - When provisioning a Databricks cluster on Azure, you can now use Azure cluster-scoped init scripts stored on Azure Blob File System that are accessible using an ADLS Gen2 storage account.
- Job submission - When you start a pipeline,
Transformer now submits the pipeline to a Databricks cluster
directly as a workload, creating an ephemeral job.
- Amazon EMR:
- Stages
-
- New JDBC Query origin - Use the JDBC Query origin to read data from database tables with a custom query.
- JDBC origin renamed - To clarify the difference between this existing origin and the new JDBC Query origin, the JDBC origin is now known as the JDBC Table origin.
- Credential stores
-
- Hashicorp Vault credential store - You can use Hashicorp Vault as a credential store for Transformer.
- Additional enhancements
-
- Job functions - You can now use job functions when you configure any pipeline property that allows expressions.
- Enabling HTTPS for Transformer - You can now store the keystore and truststore files in the Transformer resources directory,
<installation_dir>/externalResources/resources
, and then enter a path relative to that directory when you define the keystore and truststore location. This can have upgrade impact.
Upgrade Impact
- Java JDK 11 enforcement for Scala 2.12 installations
- With this release, when Transformer is prebuilt with Scala 2.12, it requires a Java JDK 11 installation. In previous releases, though required by Transformer prebuilt with Scala 2.12, a Java JDK 11 installation was not enforced.
- Databricks job submission change
- With this release, Transformer submits jobs to Databricks differently from previous releases.
- HDInsight pipelines with ADLS stages
- With this release, when you include an ADLS Gen1 or Gen2 stage in a pipeline
that runs on an Apache Spark for HDInsight cluster, the stage must use the
ADLS cluster-provided libraries
stage library. - Enabling HTTPS for Transformer
- With this release, when you enable HTTPS for Transformer, you can store the keystore and truststore files in the Transformer resources directory,
<installation_dir>/externalResources/resources
. You can then enter a path relative to that directory when you define the keystore and truststore location in the Transformer configuration properties.
4.1.0 Fixed Issues
- Pipelines with ADLS stages that run on Azure HDInsight 4.0 clusters with Transformer built for Spark 2.4 fail to start. This fix might cause upgrade impact.
- When pipeline failover
is enabled for a Control Hub
job that runs a Transformer
pipeline, the job can hang in a failover Transformer in
a
STARTING
state when the Spark job completes before the failover Transformer fully takes over the Control Hub job. - Record header attributes are added to Hive tables as new columns during pipeline preview if the Write to Destinations preview property is enabled.
- A pipeline fails to start when a Kafka origin is configured to read messages starting from a specified offset.
4.1.x Known Issues
- When provisioning a Databricks cluster, user-defined tags defined in cluster configuration properties are not being set.
- A successful Transformer pipeline run on a provisioned Databricks cluster displays a Cancelled or
Failed status in Databricks.
Workaround: When the Databricks job status for a pipeline on a provisioned Databricks cluster differs from job status in Control Hub, trust the status reported by Control Hub.
- At this time, you cannot preview data using the cluster manager configured for the pipeline when the pipeline includes Delta Lake, Hive, or MapR Hive stages. Pipelines containing Hive and MapR Hive stages can produce results, but use the metastore URI in the Hive configuration file, ignoring the optional Metastore URI stage property.
- Starting a cluster
pipeline multiple times, in quick succession, can cause the pipeline to hang
with the following error:
Transformer Spark app is already running. Waiting for callback...
Workaround: Wait a few seconds before starting the pipeline again.
- Pipelines that require passing resources such as private key files to Spark executors fail to run on any cluster type except EMR, where this issue has been fixed.
- The Google BigQuery destination fails to write numeric data with a scale greater than 9 to BigQuery columns defined as BIGNUMERIC data type.
-
Restarting a cluster pipeline shortly after starting it can cause the cluster to use the same Spark application ID for both pipeline runs, leading to errors.
Workaround: Wait a few seconds before starting the pipeline again.
- The File origin processes files with mixed schemas.
4.0.x Release Notes
The Transformer 4.0.0 release occurred on June 21, 2021.
New Features and Enhancements
- Spark 3 and Scala 2.12 support
-
Transformer supports using Spark 3.0 and Scala 2.12 for some cluster types. As a result, StreamSets now provides different installation packages for Transformer.
For information about the clusters that support Spark 3.0, see Cluster Compatibility Matrix. For information about the features available in different versions of Spark, see Spark Versions and Available Features.
- Stages
-
- New Amazon Redshift origin - Use the Amazon Redshift origin to read data from an Amazon Redshift table.
- Clusters
-
- Amazon EMR enhancements:
- Additional EMR support - You can run pipelines on EMR 6.1.x or later 6.x.x clusters. For all supported versions, see Cluster Compatibility Matrix.
- Bootstrap actions support - When you provision a cluster, you can define bootstrap actions scripts in cluster configuration properties or you can use bootstrap actions scripts stored on Amazon S3.
- Databricks clusters:
- Additional Databricks support - You can run pipelines on Databricks 7.x and 8.x clusters. For all supported versions, see Cluster Compatibility Matrix.
- Cluster-scoped init script support - When you provision a cluster, you can define cluster-scoped init scripts in cluster configuration properties. You can also use cluster-scoped init scripts stored on DBFS or S3. Specifying a location on Azure is not available at this time.
- Databricks failover support - You can configure pipeline failover for Databricks pipelines.
- Application Name enhancement - When specifying an application name for a cluster, you can now use underscores in addition to alphanumeric characters.
- Amazon EMR enhancements:
- Connections
-
With this release, the following stages support using connections:
- Additional enhancements
-
- TRANSFORMER_EXTERNAL_RESOURCES environment variable - An optional root
directory for external resources, such as external libraries and runtime
resources.
The default location is $TRANSFORMER_DIST/externalResources.
- TRANSFORMER_EXTERNAL_RESOURCES environment variable - An optional root
directory for external resources, such as external libraries and runtime
resources.
4.0.0 Fixed Issues
- When you force
stop an EMR pipeline, the Spark job on EMR continues to run until the last batch
is written.
With this fix, when you force stop an EMR pipeline, Transformer first tries to stop the Spark job through the YARN service in the cluster. If the YARN service is not reachable, Transformer sends a new step to the EMR cluster with the stop command.
As a result, if the YARN service is not reachable, Transformer can only force stop the pipeline when all of the following are true:- The pipeline runs on EMR 5.28 or later with support for step concurrency.
- The Step Concurrency property in the pipeline is set to 2 or higher.
- A step becomes available.
- When a Databricks pipeline successfully completes, Transformer indicates that it has finished running. However, on the Databricks cluster, the Spark job seems to be cancelled instead.
- Pipelines that require passing resources such as private key files to Spark executors fail to run on EMR clusters.
-
Upgrading pipelines with the Amazon S3 destination created on Transformer 3.15.0 or earlier to Transformer 3.16.x - 3.18.x can generate errors related to the Partition by Fields stage property.
- Errors occur when using the Amazon S3 origin and destination in the same pipeline when reading from and writing to different regions.
- The Field Renamer processor does not rename fields for empty batches.
4.0.x Known Issues
- A successful Transformer pipeline run on a provisioned Databricks cluster displays a Cancelled or
Failed status in Databricks.
Workaround: When the Databricks job status for a pipeline on a provisioned Databricks cluster differs from job status in Control Hub, trust the status reported by Control Hub.
- At this time, you
cannot preview data using the cluster manager configured for the pipeline when
the pipeline includes Delta Lake, Hive, or MapR Hive stages. Pipelines
containing Hive and MapR Hive stages can produce results, but use the metastore
URI in the Hive configuration file, ignoring the optional Metastore URI stage
property.When
pipeline failover is enabled for a Control Hub job that runs a Transformer pipeline, and the Spark job completes before a failover Transformer fully takes over the Control Hub, the Control Hub job can hang in the failover Transformer in a
STARTING
state with the following error:CONTAINER_0102 - Cannot change state from STARTING to FINISHING
Workaround: To correctly finish the Control Hub job, use Control Hub to force stop the job and wait until the job reaches an
INACTIVE_ERROR
state. Then, acknowledge the error. - Starting a cluster
pipeline multiple times, in quick succession, can cause the pipeline to hang
with the following error:
Transformer Spark app is already running. Waiting for callback...
Workaround: Wait a few seconds before starting the pipeline again.
- Pipelines that require passing resources such as private key files to Spark executors fail to run on any cluster type except EMR, where this issue has been fixed.
- A pipeline fails to start when a Kafka origin is configured to read messages starting from a specified offset.
- The Google BigQuery destination fails to write numeric data with a scale greater than 9 to BigQuery columns defined as BIGNUMERIC data type.
- Record header attributes are added to Hive tables as new columns during pipeline preview if the Write to Destinations preview property is enabled.
-
Restarting a cluster pipeline shortly after starting it can cause the cluster to use the same Spark application ID for both pipeline runs, leading to errors.
Workaround: Wait a few seconds before starting the pipeline again.
- The File origin processes files with mixed schemas.