Release Notes

5.2.x Release Notes

The Transformer 5.2.0 release occurred on October 27, 2022.

New Features and Enhancements

New stage
XML Parser processor - You can use the new XML Parser processor to parse an XML document in a string field and pass the parsed data to a map field.
Database table origins
The MySQL JDBC Table, Oracle JDBC Table, PostgreSQL JDBC Table, and SQL Server JDBC Table origins include the following enhancements:
  • Support for multiple tables - These origins can read from multiple tables. The Table property in previous releases has become the Tables property. With one of these origins configured to read from multiple tables, a pipeline processes multiple batches, in both batch and streaming mode.
  • Batch headers - Pipelines with these origins generate batch headers for each batch and the origin writes the jdbc.table attribute in the header. The attribute stores the name of the table that the origin reads for the batch.
Other stage enhancements
  • Data Formats property - The File, ADLS Gen1, ADLS Gen2, Amazon S3, and Google Cloud Storage stages have an Additional Data Format Configuration property on the Data Formats tab. You can use this property to enter other data format parameters.
  • File-based destinations - File-based destinations that write in the Text data format generate a validation error if the Partition by Fields property is enabled.
  • Security option for Kafka stages - Kafka stages support the Custom Authentication (Security Protocol=CUSTOM) security option. Use the option to specify custom properties that contain the information required by a security protocol, rather than using predefined properties associated with other security options.
  • Snowflake destination - When merging data, the destination supports multiple join keys.
Clusters
  • Amazon EMR clusters - You can set the Amazon EMR runtime role with the Execution Role property.
  • Google Dataproc clusters
    • In pipelines running on Dataproc 2.0.40 clusters, you can include the Amazon Redshift destination and the Avro data format in Amazon S3 stages and Google Cloud Storage stages.
    • You can access details about the Dataproc job run for a Transformer pipeline in one of the following ways:
      • After the corresponding Control Hub job completes, view the job run summary from the job History tab which displays the Dataproc Job URL. Use the URL to access the Dataproc job in the Google Cloud Console.
      • Log into the Google Cloud Console and view the list of jobs run on the Dataproc cluster. Filter the jobs by the streamsets-transformer-pipeline-id or streamsets-transformer-pipeline-name label which are applied to all Dataproc jobs run for Transformer pipelines.
Additional enhancements
  • Expression language - Batch functions retrieve the value of an attribute in a batch header. You can use the functions in specific properties of destination stages.
  • Credential stores
    • Google Secret Manager - You can use Google Secret Manager as a credential store for Transformer.
    • Property for migration - You can use a new credentialStores.usePortableGroups credential stores property to migrate pipelines that access credential stores from one Control Hub organization to another. Contact StreamSets Support before enabling this option.

Upgrade Impact

Review use of generated XML files
In Transformer 5.2.0 prebuilt with Scala 2.12, the XML files that destinations write include the following initial XML declaration:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

If you use Transformer prebuilt with Scala 2.12, then after you upgrade to 5.2.0, review how pipelines use the generated files and make necessary changes to account for the new initial declaration.

Update SDK code for database-vendor-specific JDBC origins
In Transformer 5.2.0, four origins – MySQL JDBC Table, Oracle JDBC Table, PostgreSQL JDBC Table, and SQL Server JDBC Table – replace the Table property that accepted a single table with the Tables property that accepts a list of multiple tables.

After you upgrade to 5.2.0, review your SDK for Python code for these origins and replace origin.table with origin.tables.

5.2.0 Fixed Issues

  • File-based stages in Transformer do not work correctly with Avro files in Spark 2.4.7 clusters.
  • Some Databricks cluster configuration parameters for provisioning are ignored.
  • Pipelines running on a Dataproc cluster fail, unable to access or create a directory specified in a resource file.

5.2.x Known Issues

  • Transformer cannot locate a separate runtime properties file that has been uploaded as an external resource for the engine.

    Workaround: Define runtime properties in the Transformer configuration properties instead of in a separate runtime properties file.

  • The Snowflake destination fails when attempting to create a new table and the destination is configured as follows:
    • Column Mapping Mode property is set to "By Name"
    • Write Mode property is set to "Append rows to existing table or create table if none exists"
    The pipeline produces the following error:
    Caused by: net.snowflake.client.jdbc.SnowflakeSQLException: SQL compilation error:
    Object 'X_DB.X_SCHEMA.X_TABLE' does not exist or not authorized
    

    Workaround: Create the table before running the pipeline.

  • In a PySpark processor, you cannot create views from inputs and then run queries on those views.
    For example, you cannot use the following code:
    inputs[0].createOrReplaceTempView('test999')
    output = spark.sql('select * from test999')

    Workaround: Recreate the DataFrame in the Python Spark session.

    For example:
    df = spark.createDataFrame(inputs[0].rdd)
    df.createOrReplaceTempView("test999")
    output = spark.sql('select * from test999')
  • Pipelines that run on a Spark 3.x Cloudera Data Engineering cluster can fail when they include a File stage that reads from or writes to HDFS with Kerberos enabled.
  • When trying to access Amazon Redshift or Microsoft SQL Server from a Google Dataproc pipeline, the pipeline fails due to a known issue with the default Dataproc Java security provider, Conscrypt.

    SQL Server pipelines fail with the following error:

    Last callback from Transformer Spark application exceeds the specified time configured in 'transformer.driver.max.inactive.interval' : 30000 ms. Check logs for error.
    Redshift pipelines fail with the following error:
    Could not connect to Redshift DB due to error: [JDBC Driver] null

    Workaround: Use one of the following methods to disable Conscrypt:

    • For existing clusters, on all cluster nodes, edit the $JAVA_HOME/jre/lib/security/java.security security configuration file, and disable the org.conscrypt.OpenSSLProvider property.
    • For new clusters, provision clusters with the dataproc:dataproc.conscrypt.provider.enable property set to false.
  • When previewing a Dataproc pipeline that is configured both to provision and to terminate a cluster, Transformer fails to terminate the cluster when preview completes.
  • At this time, you cannot preview data using the cluster manager configured for the pipeline when the pipeline includes Delta Lake, Hive, or MapR Hive stages. Pipelines containing Hive and MapR Hive stages can produce results, but use the metastore URI in the Hive configuration file, ignoring the optional Metastore URI stage property.
  • Pipelines that require passing resources such as private key files to Spark executors fail to run on any cluster type except EMR, where this issue has been fixed.

5.1.x Release Notes

The Transformer 5.1.0 release occurred on July 26, 2022.

New Features and Enhancements

Cluster support
  • Databricks 10.4 - You can run pipelines on Databricks 10.4 clusters.
  • Dataproc 2.0.40 - You can run pipelines on Dataproc 2.0.40 clusters. However, with this version pipelines cannot include the Amazon Redshift destination or the Avro data format in Amazon S3 stages or Google Cloud Storage stages.
Database table origins
This release includes the following enhancements for the MySQL JDBC Table, Oracle JDBC Table, PostgreSQL JDBC Table, and SQL Server JDBC Table origins:
  • Maximum number of partitions - The Number of Partitions property in previous releases has become the Maximum Number of Partitions property. When the pipeline runs, the origin creates up to the specified number of partitions. This allows the pipeline to run if the origin cannot create the specified number of partitions.

    In previous versions, if an origin cannot create the specified number of partitions, the pipeline fails.

  • Automatic partition selection - When you configure an origin to skip offset tracking, the origin attempts to select a logical partition column for the read. For more information, see “Partition Column Selection” in the documentation for the origin.

    In previous versions, when an offset column could not be found to be used for partitioning, the pipeline fails.

Runtime parameters
You can use runtime parameters to represent a stage or pipeline property that displays as a list of configurations. For example, you can use a runtime parameter to define the Additional JDBC Configuration Properties for the JDBC Table origins.

5.1.x Known Issues

  • Pipelines that run on a Spark 3.x Cloudera Data Engineering cluster can fail when they include a File stage that reads from or writes to HDFS with Kerberos enabled.
  • When trying to access Amazon Redshift or Microsoft SQL Server from a Google Dataproc pipeline, the pipeline fails due to a known issue with the default Dataproc Java security provider, Conscrypt.

    SQL Server pipelines fail with the following error:

    Last callback from Transformer Spark application exceeds the specified time configured in 'transformer.driver.max.inactive.interval' : 30000 ms. Check logs for error.
    Redshift pipelines fail with the following error:
    Could not connect to Redshift DB due to error: [JDBC Driver] null

    Workaround: Use one of the following methods to disable Conscrypt:

    • For existing clusters, on all cluster nodes, edit the $JAVA_HOME/jre/lib/security/java.security security configuration file, and disable the org.conscrypt.OpenSSLProvider property.
    • For new clusters, provision clusters with the dataproc:dataproc.conscrypt.provider.enable property set to false.
  • When previewing a Dataproc pipeline that is configured both to provision and to terminate a cluster, Transformer fails to terminate the cluster when preview completes.
  • At this time, you cannot preview data using the cluster manager configured for the pipeline when the pipeline includes Delta Lake, Hive, or MapR Hive stages. Pipelines containing Hive and MapR Hive stages can produce results, but use the metastore URI in the Hive configuration file, ignoring the optional Metastore URI stage property.
  • Pipelines that require passing resources such as private key files to Spark executors fail to run on any cluster type except EMR, where this issue has been fixed.
  • Kafka stages include a Custom Authentication (Security Protocol=CUSTOM) option for the Security Option property and related custom security properties that are not yet supported. Do not use the custom security option or specify custom security properties. When defined, custom security properties are ignored.

5.0.x Release Notes

The Transformer 5.0.0 release occurred on May 30, 2022.

New Features and Enhancements

Stage enhancements
  • Snowflake stages property rename and enhancement - The Additional Snowflake Configuration Properties property is now named Connection Properties and is moved from the Advanced tab to the Connection tab. In addition, you can specify credential functions in the property value to retrieve secrets stored in a credential store. This change affects the following stages:
    • Snowflake origin
    • Snowflake Lookup processor
    • Snowflake destination
Transformer logs
Transformer uses the Apache Log4j 2.17.2 library to write log data. In previous releases, Transformer used the Apache Log4j 1.x library which is now end-of-life.
Proxy server configuration
To configure Transformer to use a proxy server for outbound network requests, define proxy properties when you set up the deployment.
Previously, you configured Transformer to use a proxy server by defining Java configuration options for the deployment and then setting the STREAMSETS_BOOTSTRAP_JAVA_OPTS environment variable on the Transformer machine.

5.0.0 Fixed Issues

  • You cannot preview or validate a pipeline using embedded Spark libraries.
  • Transformer 4.0.0 or later cannot load runtime resources for a pipeline running on a Hadoop YARN Cloudera distribution cluster.

5.0.x Known Issues

  • Pipelines that run on a Spark 3.x Cloudera Data Engineering cluster can fail when they include a File stage that reads from or writes to HDFS with Kerberos enabled.
  • When trying to access Amazon Redshift or Microsoft SQL Server from a Google Dataproc pipeline, the pipeline fails due to a known issue with the default Dataproc Java security provider, Conscrypt.

    SQL Server pipelines fail with the following error:

    Last callback from Transformer Spark application exceeds the specified time configured in 'transformer.driver.max.inactive.interval' : 30000 ms. Check logs for error.
    Redshift pipelines fail with the following error:
    Could not connect to Redshift DB due to error: [JDBC Driver] null

    Workaround: Use one of the following methods to disable Conscrypt:

    • For existing clusters, on all cluster nodes, edit the $JAVA_HOME/jre/lib/security/java.security security configuration file, and disable the org.conscrypt.OpenSSLProvider property.
    • For new clusters, provision clusters with the dataproc:dataproc.conscrypt.provider.enable property set to false.
  • Due to memory issues in older Databricks clusters, communication failures can occur when running pipelines on those clusters. The memory issues can generate error messages such as the following:
    Failed to start the pipeline, Databricks Cluster Driver is up but is not responsive, likely due to GC.
    GC overhead limit exceeded
    Driver Error: Last callback from Transformer Spark application exceeds the specified time configured in 'transformer.driver.max.inactive.interval'
    Workaround: To address the memory issues, try one or both of the following solutions:
    • Fine tune the Spark configuration properties related to memory, such as spark.driver.memory, spark.driver.cores, spark.executor.memory, and spark.executor.cores.
    • Increase the memory on Spark cluster nodes.
  • When previewing a Dataproc pipeline that is configured both to provision and to terminate a cluster, Transformer fails to terminate the cluster when preview completes.
  • A successful Transformer pipeline run on a provisioned Databricks cluster displays a Cancelled or Failed status in Databricks.

    Workaround: When the Databricks job status for a pipeline on a provisioned Databricks cluster differs from job status in Control Hub, trust the status reported by Control Hub.

  • At this time, you cannot preview data using the cluster manager configured for the pipeline when the pipeline includes Delta Lake, Hive, or MapR Hive stages. Pipelines containing Hive and MapR Hive stages can produce results, but use the metastore URI in the Hive configuration file, ignoring the optional Metastore URI stage property.
  • Starting a cluster pipeline multiple times, in quick succession, can cause the pipeline to hang with the following error:

    Transformer Spark app is already running. Waiting for callback...

    Workaround: Wait a few seconds before starting the pipeline again.

  • Pipelines that require passing resources such as private key files to Spark executors fail to run on any cluster type except EMR, where this issue has been fixed.
  • The Google BigQuery destination fails to write numeric data with a scale greater than 9 to BigQuery columns defined as BIGNUMERIC data type.
  • Restarting a cluster pipeline shortly after starting it can cause the cluster to use the same Spark application ID for both pipeline runs, leading to errors.

    Workaround: Wait a few seconds before starting the pipeline again.

  • The File origin processes files with mixed schemas.

4.3.x Release Notes

The Transformer 4.3.0 release occurred on April 29, 2022.

New Features and Enhancements

Cluster support
New stage
Stage enhancements
  • Amazon S3 origin property rename - The Bucket property is now named Bucket and Path. It has always allowed entering a path that includes the asterisk (*) and question mark (?) wildcards.
  • New empty dataframe behavior for JDBC Table origins - When there is no data to be read, the following origins now pass the table schema in an empty dataframe:
    • JDBC Table origin
    • MySQL JDBC Table origin
    • Oracle JDBC Table origin
    • PostgreSQL JDBC Table origin

    In previous releases, these origins passed an empty schema with empty dataframes. This change has no upgrade impact because it includes a new Use Empty Schemas property that passes an empty schema with empty dataframes.

    To preserve backward compatibility, the Use Empty Schemas property is enabled for all upgraded pipelines. For new pipelines, this property is disabled by default.

  • Partition Base Path origin property - The following origins now allow specifying a base path for partitions in a Partition Base Path property:
    • ADLS Gen1 origin
    • ADLS Gen2 origin
    • Amazon S3 origin
    • File origin
    • Google Cloud Storage origin
    • MapR FS origin
  • Skip Empty Batches destination property - The following destinations can now skip writing empty batches when you select the Skip Empty Batches property:
    • ADLS Gen1 destination
    • ADLS Gen2 destination
    • Amazon S3 destination
    • File destination
    • Google Cloud Storage destination
    • MapR FS destination

4.3.0 Fixed Issues

  • When a job has failover enabled and the Transformer that runs the pipeline fails, and the same Transformer attempts to recover the pipeline, the pipeline enters a RUN_ERROR state and fails with the following error message:
    RUN_ERROR:Error while trying to check the status of disconnected driver pipeline: st.shaded.org.glassfish.jersey.message.internal.MessageBodyProviderNotFoundException: MessageBodyReader not found for media type=text/html;charset=utf-8, type=interface com.streamsets.datacollector.execution.PipelineState, genericType=interface com.streamsets.datacollector.execution.PipelineState.

4.3.x Known Issues

  • You cannot preview or validate a pipeline using embedded Spark libraries.
  • Pipelines that run on a Spark 3.x Cloudera Data Engineering cluster can fail when they include a File stage that reads from or writes to HDFS with Kerberos enabled.
  • When trying to access Amazon Redshift or Microsoft SQL Server from a Google Dataproc pipeline, the pipeline fails due to a known issue with the default Dataproc Java security provider, Conscrypt.

    SQL Server pipelines fail with the following error:

    Last callback from Transformer Spark application exceeds the specified time configured in 'transformer.driver.max.inactive.interval' : 30000 ms. Check logs for error.
    Redshift pipelines fail with the following error:
    Could not connect to Redshift DB due to error: [JDBC Driver] null

    Workaround: Use one of the following methods to disable Conscrypt:

    • For existing clusters, on all cluster nodes, edit the $JAVA_HOME/jre/lib/security/java.security security configuration file, and disable the org.conscrypt.OpenSSLProvider property.
    • For new clusters, provision clusters with the dataproc:dataproc.conscrypt.provider.enable property set to false.
  • Due to memory issues in older Databricks clusters, communication failures can occur when running pipelines on those clusters. The memory issues can generate error messages such as the following:
    Failed to start the pipeline, Databricks Cluster Driver is up but is not responsive, likely due to GC.
    GC overhead limit exceeded
    Driver Error: Last callback from Transformer Spark application exceeds the specified time configured in 'transformer.driver.max.inactive.interval'
    Workaround: To address the memory issues, try one or both of the following solutions:
    • Fine tune the Spark configuration properties related to memory, such as spark.driver.memory, spark.driver.cores, spark.executor.memory, and spark.executor.cores.
    • Increase the memory on Spark cluster nodes.
  • When previewing a Dataproc pipeline that is configured both to provision and to terminate a cluster, Transformer fails to terminate the cluster when preview completes.
  • A successful Transformer pipeline run on a provisioned Databricks cluster displays a Cancelled or Failed status in Databricks.

    Workaround: When the Databricks job status for a pipeline on a provisioned Databricks cluster differs from job status in Control Hub, trust the status reported by Control Hub.

  • At this time, you cannot preview data using the cluster manager configured for the pipeline when the pipeline includes Delta Lake, Hive, or MapR Hive stages. Pipelines containing Hive and MapR Hive stages can produce results, but use the metastore URI in the Hive configuration file, ignoring the optional Metastore URI stage property.
  • Starting a cluster pipeline multiple times, in quick succession, can cause the pipeline to hang with the following error:

    Transformer Spark app is already running. Waiting for callback...

    Workaround: Wait a few seconds before starting the pipeline again.

  • Pipelines that require passing resources such as private key files to Spark executors fail to run on any cluster type except EMR, where this issue has been fixed.
  • The Google BigQuery destination fails to write numeric data with a scale greater than 9 to BigQuery columns defined as BIGNUMERIC data type.
  • Restarting a cluster pipeline shortly after starting it can cause the cluster to use the same Spark application ID for both pipeline runs, leading to errors.

    Workaround: Wait a few seconds before starting the pipeline again.

  • The File origin processes files with mixed schemas.

4.2.x Release Notes

The Transformer 4.2.0 release occurred on January 21, 2022.

New Features and Enhancements

Internal update
This release includes internal updates to support an upcoming StreamSets DataOps Platform Control Hub feature.
Note: All new Transformer deployments on StreamSets DataOps Platform will use Transformer version 4.2.0 or higher. Existing deployments are not affected.
Clusters
Stage enhancements
  • Amazon S3 destination - When configuring the destination, you can now use the s3 URI scheme, in addition to the s3a scheme. Best practice is to use s3 with EMR clusters and s3a with all other clusters.
  • Field Replacer processor - Use Spark SQL expressions to generate new values for specified fields. You can use quotation marks to specify a string.
  • Google Big Query origin - The origin can now read from Google BigQuery views.
Connections
  • Amazon EMR cluster connections include the following enhancements:
    • You can configure Amazon EMR cluster connections to assume another role.
    • You can specify a SSH EC2 Key ID property for the EC2 SSH key to be used on nodes of the cluster.
Deprecation and testing update
  • The Cloudera CDH 5.x stage libraries are now deprecated. As a result, StreamSets no longer tests Transformer against Cloudera CDH 5.x.
Additional enhancements
  • CyberArk credential store support - You can use CyberArk as a credential store for Transformer.
  • Cluster URL access - When monitoring a Control Hub job for a Databricks pipeline, when you view the job summary, you can now access the Databricks cluster job URL.

4.2.0 Fixed Issues

  • Redshift destinations fail to write partitioned data when running on Databricks cluster versions 7.x and later. The pipeline fails with the following error:
    java.sql.SQLException: Invalid operation: Mandatory url is not present in manifest file.
  • The Scala processor always checks if a batch is empty instead of checking only when the Skip Empty Batches property is enabled. This slows performance.
  • Runtime resources are not accessible from Transformer pipelines.
  • When provisioning a Databricks cluster, the user-defined tags defined in cluster configuration properties are not being set.
  • When provisioning a Databricks cluster, the policy_id parameter defined in the cluster configuration properties is ignored.
  • For pipelines run on Databricks clusters, resources are staged in the EBS volumes instead of the Databricks distributed file system (DBFS), and the resources are not being removed when no longer needed.

4.2.x Known Issues

  • As noted in the StreamSets Technical Service Bulletin, Transformer 3.12.0 and later are not vulnerable to the Apache Log4j zero-day vulnerability documented in CVE-2021-44228.

    However, StreamSets highly recommends that you update all clusters that run Transformer pipelines to protect against the zero-day vulnerability.

  • When trying to access Amazon Redshift or Microsoft SQL Server from a Google Dataproc pipeline, the pipeline fails due to a known issue with the default Dataproc Java security provider, Conscrypt.

    SQL Server pipelines fail with the following error:

    Last callback from Transformer Spark application exceeds the specified time configured in 'transformer.driver.max.inactive.interval' : 30000 ms. Check logs for error.
    Redshift pipelines fail with the following error:
    Could not connect to Redshift DB due to error: [JDBC Driver] null

    Workaround: Use one of the following methods to disable Conscrypt:

    • For existing clusters, on all cluster nodes, edit the $JAVA_HOME/jre/lib/security/java.security security configuration file, and disable the org.conscrypt.OpenSSLProvider property.
    • For new clusters, provision clusters with the dataproc:dataproc.conscrypt.provider.enable property set to false.
  • Due to memory issues in older Databricks clusters, communication failures can occur when running pipelines on those clusters. The memory issues can generate error messages such as the following:
    Failed to start the pipeline, Databricks Cluster Driver is up but is not responsive, likely due to GC.
    
    GC overhead limit exceeded
    
    Driver Error: Last callback from Transformer Spark application exceeds the specified time configured in 'transformer.driver.max.inactive.interval'
    Workaround: To address the memory issues, try one or both of the following solutions:
    • Fine tune the Spark configuration properties related to memory, such as spark.driver.memory, spark.driver.cores, spark.executor.memory, and spark.executor.cores.
    • Increase the memory on Spark cluster nodes.
  • When previewing a Dataproc pipeline that is configured both to provision and to terminate a cluster, Transformer fails to terminate the cluster when preview completes.
  • When a job has failover enabled and the Transformer that runs the pipeline fails, and the same Transformer attempts to recover the pipeline, the pipeline enters a RUN_ERROR state and fails with the following error message:
    RUN_ERROR:Error while trying to check the status of disconnected driver pipeline: st.shaded.org.glassfish.jersey.message.internal.MessageBodyProviderNotFoundException: MessageBodyReader not found for media type=text/html;charset=utf-8, type=interface com.streamsets.datacollector.execution.PipelineState, genericType=interface com.streamsets.datacollector.execution.PipelineState.
  • A successful Transformer pipeline run on a provisioned Databricks cluster displays a Cancelled or Failed status in Databricks.

    Workaround: When the Databricks job status for a pipeline on a provisioned Databricks cluster differs from job status in Control Hub, trust the status reported by Control Hub.

  • At this time, you cannot preview data using the cluster manager configured for the pipeline when the pipeline includes Delta Lake, Hive, or MapR Hive stages. Pipelines containing Hive and MapR Hive stages can produce results, but use the metastore URI in the Hive configuration file, ignoring the optional Metastore URI stage property.
  • Starting a cluster pipeline multiple times, in quick succession, can cause the pipeline to hang with the following error:

    Transformer Spark app is already running. Waiting for callback...

    Workaround: Wait a few seconds before starting the pipeline again.

  • Pipelines that require passing resources such as private key files to Spark executors fail to run on any cluster type except EMR, where this issue has been fixed.
  • The Google BigQuery destination fails to write numeric data with a scale greater than 9 to BigQuery columns defined as BIGNUMERIC data type.
  • Restarting a cluster pipeline shortly after starting it can cause the cluster to use the same Spark application ID for both pipeline runs, leading to errors.

    Workaround: Wait a few seconds before starting the pipeline again.

  • The File origin processes files with mixed schemas.

4.1.x Release Notes

The Transformer 4.1.0 release occurred on September 27, 2021.

New Features and Enhancements

Clusters
  • Amazon EMR:
    • AWS tags - When provisioning an Amazon EMR cluster, you can specify AWS tags for the cluster.
    • Regions - You can now specify additional regions for EMR clusters.
  • Google Dataproc:
    • Dataproc labels - When provisioning a Google Dataproc cluster, you can specify Dataproc labels for the cluster.
    • Credentials files - You can now specify relative paths in addition to absolute paths to service account credentials files.
    • Regions - You can now specify additional regions for Dataproc clusters.
  • Databricks:
    • Job submission - When you start a pipeline, Transformer now submits the pipeline to a Databricks cluster directly as a workload, creating an ephemeral job.
      Previously, Transformer created one-time jobs, which counted against the job limit on the account. Ephemeral jobs do not count towards the job limit.
      Note: The details of ephemeral jobs do not display with regular jobs through the Databricks job menu. For details, see Upgrade Impact.
    • Init script enhancement - When provisioning a Databricks cluster on Azure, you can now use Azure cluster-scoped init scripts stored on Azure Blob File System that are accessible using an ADLS Gen2 storage account.
Stages
  • New JDBC Query origin - Use the JDBC Query origin to read data from database tables with a custom query.
  • JDBC origin renamed - To clarify the difference between this existing origin and the new JDBC Query origin, the JDBC origin is now known as the JDBC Table origin.
Credential stores
Additional enhancements
  • Job functions - You can now use job functions when you configure any pipeline property that allows expressions.
  • Enabling HTTPS for Transformer - You can now store the keystore and truststore files in the Transformer resources directory, <installation_dir>/externalResources/resources, and then enter a path relative to that directory when you define the keystore and truststore location. This can have upgrade impact.

Upgrade Impact

Java JDK 11 enforcement for Scala 2.12 installations
With this release, when Transformer is prebuilt with Scala 2.12, it requires a Java JDK 11 installation. In previous releases, though required by Transformer prebuilt with Scala 2.12, a Java JDK 11 installation was not enforced.
If you upgrade to Transformer 4.1.0 prebuilt with Scala 2.12, you must have Java JDK 11 installed on the Transformer machine for Transformer to start.
Databricks job submission change
With this release, Transformer submits jobs to Databricks differently from previous releases.
In previous releases, with each pipeline run, Transformer creates a standard Databricks job, but uses it only once. This job counts toward the Databricks jobs limit.
With this release, Transformer submits ephemeral jobs to Databricks. An ephemeral job runs only once, and does not count towards the Databricks job limit. However, the details for the ephemeral jobs are only available for 60 days, and are not available through the Databricks jobs menu. For information about accessing job details, see Accessing Databricks Job Details.
HDInsight pipelines with ADLS stages
With this release, when you include an ADLS Gen1 or Gen2 stage in a pipeline that runs on an Apache Spark for HDInsight cluster, the stage must use the ADLS cluster-provided libraries stage library.
Enabling HTTPS for Transformer
With this release, when you enable HTTPS for Transformer, you can store the keystore and truststore files in the Transformer resources directory, <installation_dir>/externalResources/resources. You can then enter a path relative to that directory when you define the keystore and truststore location in the Transformer configuration properties.
In previous releases, you could store the keystore and truststore files in the Transformer configuration directory, <installation_dir>/etc, and then define the location to the file using a path relative to that directory. You can continue to store the file in the configuration directory, but StreamSets recommends moving it to the resources directory when you upgrade.

4.1.0 Fixed Issues

  • Pipelines with ADLS stages that run on Azure HDInsight 4.0 clusters with Transformer built for Spark 2.4 fail to start. This fix might cause upgrade impact.
  • Transformer does not enforce the Java JDK 11 requirement for Transformer prebuilt with Scala 2.12. This fix might cause upgrade impact.
  • When pipeline failover is enabled for a Control Hub job that runs a Transformer pipeline, the job can hang in a failover Transformer in a STARTING state when the Spark job completes before the failover Transformer fully takes over the Control Hub job.
  • Record header attributes are added to Hive tables as new columns during pipeline preview if the Write to Destinations preview property is enabled.
  • A pipeline fails to start when a Kafka origin is configured to read messages starting from a specified offset.

4.1.x Known Issues

  • When provisioning a Databricks cluster, user-defined tags defined in cluster configuration properties are not being set.
  • A successful Transformer pipeline run on a provisioned Databricks cluster displays a Cancelled or Failed status in Databricks.

    Workaround: When the Databricks job status for a pipeline on a provisioned Databricks cluster differs from job status in Control Hub, trust the status reported by Control Hub.

  • At this time, you cannot preview data using the cluster manager configured for the pipeline when the pipeline includes Delta Lake, Hive, or MapR Hive stages. Pipelines containing Hive and MapR Hive stages can produce results, but use the metastore URI in the Hive configuration file, ignoring the optional Metastore URI stage property.
  • Starting a cluster pipeline multiple times, in quick succession, can cause the pipeline to hang with the following error:

    Transformer Spark app is already running. Waiting for callback...

    Workaround: Wait a few seconds before starting the pipeline again.

  • Pipelines that require passing resources such as private key files to Spark executors fail to run on any cluster type except EMR, where this issue has been fixed.
  • The Google BigQuery destination fails to write numeric data with a scale greater than 9 to BigQuery columns defined as BIGNUMERIC data type.
  • Restarting a cluster pipeline shortly after starting it can cause the cluster to use the same Spark application ID for both pipeline runs, leading to errors.

    Workaround: Wait a few seconds before starting the pipeline again.

  • The File origin processes files with mixed schemas.

4.0.x Release Notes

The Transformer 4.0.0 release occurred on June 21, 2021.

New Features and Enhancements

Spark 3 and Scala 2.12 support

Transformer supports using Spark 3.0 and Scala 2.12 for some cluster types. As a result, StreamSets now provides different installation packages for Transformer.

For information about the clusters that support Spark 3.0, see Cluster Compatibility Matrix. For information about the features available in different versions of Spark, see Spark Versions and Available Features.

Stages
  • New Amazon Redshift origin - Use the Amazon Redshift origin to read data from an Amazon Redshift table.
Clusters
  • Amazon EMR enhancements:
    • Additional EMR support - You can run pipelines on EMR 6.1.x or later 6.x.x clusters. For all supported versions, see Cluster Compatibility Matrix.
    • Bootstrap actions support - When you provision a cluster, you can define bootstrap actions scripts in cluster configuration properties or you can use bootstrap actions scripts stored on Amazon S3.
  • Databricks clusters:
    • Additional Databricks support - You can run pipelines on Databricks 7.x and 8.x clusters. For all supported versions, see Cluster Compatibility Matrix.
    • Cluster-scoped init script support - When you provision a cluster, you can define cluster-scoped init scripts in cluster configuration properties. You can also use cluster-scoped init scripts stored on DBFS or S3. Specifying a location on Azure is not available at this time.
    • Databricks failover support - You can configure pipeline failover for Databricks pipelines.
  • Application Name enhancement - When specifying an application name for a cluster, you can now use underscores in addition to alphanumeric characters.
Connections

With this release, the following stages support using connections:

  • MySQL JDBC Table origin
  • Oracle JDBC Table origin
  • PostgreSQL JDBC Table origin
  • SQL Server JDBC Table origin
  • Amazon Redshift origin and destination - Available after the Data Collector 4.1.0 release.
Additional enhancements
  • TRANSFORMER_EXTERNAL_RESOURCES environment variable - An optional root directory for external resources, such as external libraries and runtime resources.

    The default location is $TRANSFORMER_DIST/externalResources.

4.0.0 Fixed Issues

  • When you force stop an EMR pipeline, the Spark job on EMR continues to run until the last batch is written.

    With this fix, when you force stop an EMR pipeline, Transformer first tries to stop the Spark job through the YARN service in the cluster. If the YARN service is not reachable, Transformer sends a new step to the EMR cluster with the stop command.

    As a result, if the YARN service is not reachable, Transformer can only force stop the pipeline when all of the following are true:
    • The pipeline runs on EMR 5.28 or later with support for step concurrency.
    • The Step Concurrency property in the pipeline is set to 2 or higher.
    • A step becomes available.
  • When a Databricks pipeline successfully completes, Transformer indicates that it has finished running. However, on the Databricks cluster, the Spark job seems to be cancelled instead.
  • Pipelines that require passing resources such as private key files to Spark executors fail to run on EMR clusters.
  • Upgrading pipelines with the Amazon S3 destination created on Transformer 3.15.0 or earlier to Transformer 3.16.x - 3.18.x can generate errors related to the Partition by Fields stage property.

  • Errors occur when using the Amazon S3 origin and destination in the same pipeline when reading from and writing to different regions.
  • The Field Renamer processor does not rename fields for empty batches.

4.0.x Known Issues

  • A successful Transformer pipeline run on a provisioned Databricks cluster displays a Cancelled or Failed status in Databricks.

    Workaround: When the Databricks job status for a pipeline on a provisioned Databricks cluster differs from job status in Control Hub, trust the status reported by Control Hub.

  • At this time, you cannot preview data using the cluster manager configured for the pipeline when the pipeline includes Delta Lake, Hive, or MapR Hive stages. Pipelines containing Hive and MapR Hive stages can produce results, but use the metastore URI in the Hive configuration file, ignoring the optional Metastore URI stage property.When pipeline failover is enabled for a Control Hub job that runs a Transformer pipeline, and the Spark job completes before a failover Transformer fully takes over the Control Hub, the Control Hub job can hang in the failover Transformer in a STARTING state with the following error:

    CONTAINER_0102 - Cannot change state from STARTING to FINISHING

    Workaround: To correctly finish the Control Hub job, use Control Hub to force stop the job and wait until the job reaches an INACTIVE_ERROR state. Then, acknowledge the error.

  • Starting a cluster pipeline multiple times, in quick succession, can cause the pipeline to hang with the following error:

    Transformer Spark app is already running. Waiting for callback...

    Workaround: Wait a few seconds before starting the pipeline again.

  • Pipelines that require passing resources such as private key files to Spark executors fail to run on any cluster type except EMR, where this issue has been fixed.
  • A pipeline fails to start when a Kafka origin is configured to read messages starting from a specified offset.
  • The Google BigQuery destination fails to write numeric data with a scale greater than 9 to BigQuery columns defined as BIGNUMERIC data type.
  • Record header attributes are added to Hive tables as new columns during pipeline preview if the Write to Destinations preview property is enabled.
  • Restarting a cluster pipeline shortly after starting it can cause the cluster to use the same Spark application ID for both pipeline runs, leading to errors.

    Workaround: Wait a few seconds before starting the pipeline again.

  • The File origin processes files with mixed schemas.