Pipelines

Use the following tips for help with pipeline errors:
A pipeline fails to start with the following error:
org.apache.spark.SparkException: Dynamic allocation of executors requires the external shuffle service. You may enable this through spark.shuffle.service.enabled. 
This error occurs when the Spark cluster that runs the pipeline does not have the Spark external shuffle service enabled. This is required for Transformer version 3.14.x and later. For more information, see Spark Shuffle Service Requirement.
A pipeline fails to start with the following error:
TRANSFORMER_03 Databricks shared cluster <name> already contains StreamSets libraries from a different staging directory <directory>. Either replace 'Pipeline Config > Staging Directory' config value to <directory> or uninstall all libraries from the shared cluster and restart the cluster.
This error occurs when you try to run a pipeline on an existing Databricks cluster that has previously run pipelines built on a different version of Transformer. This is not allowed.
There are several possible solutions to this issue:
  • If you still want to run pipelines of the other version on the cluster, then run the pipeline of the new version on a different cluster. To do this, you can either:
    • Specify a different existing cluster to run the pipeline. Specify a cluster that has not run Transformer or has only run pipelines built on the same version of Transformer.
    • Configure the pipeline to provision a cluster. Then, Databricks spins up a new cluster to run the pipeline, bypassing any version issues.
  • If you no longer need to run pipelines of the other version on the cluster, then you can uninstall the Transformer libraries and restart the cluster. Then, you can run this pipeline on the cluster.
A pipeline run, validation, or preview fails with errors
You might see a message such as the following, however different messages might display depending on the cluster that runs the pipeline:
TRANSFORMER_02 Failed to <preview, validate, or run> pipeline, check logs for error. The Transformer Spark application in the Spark cluster might not be able to reach Transformer at <URL>. If this is not the correct URL, update the transformer.driver.callback.url or transformer.base.http.url property in the Transformer configuration file or define a cluster callback URL for the pipeline and restart Transformer.
This problem occurs when Spark cannot communicate with Transformer. It's possible that the Spark cluster callback URL is incorrect.
Check the Transformer log or the driver log on the cluster for related information.
If the problem is with a missing or invalid cluster callback URL, ensure that at least one of the cluster callback URL properties is correctly defined. For more information, see Cluster Callback URL Properties.
A pipeline fails with the following run error:
org.apache.spark.sql.AnalysisException: Found duplicate column(s) 
The Join processor generates this error when records from both inputs contain the same field names and those fields are not used as matching fields.
To resolve this issue, select the Add Prefix to Field Names property and specify at least one prefix, for either the left input or the right input
Pipeline validation fails with the following stage library/cluster manager mismatch error:
VALIDATION_0300 Stage <stage name> using the <stage library name> stage library cannot be used with the <cluster type> cluster manager type
The stage library selected for a stage must be valid for the cluster manager type configured for the pipeline.
For example, if the pipeline uses no cluster manager type because you want it to run locally, stages with a Stage Library property cannot be configured to use a cluster-provided library.
Select a valid stage library for the stage or change the pipeline cluster manager type, as appropriate.
Pipelines running on a Hadoop YARN cluster indefinitely remain in a running or stopping status.

When you run pipelines on a Hadoop YARN cluster, the Spark submit process continues to run until the pipeline finishes, which uses memory on the Transformer machine. This memory usage can cause pipelines to indefinitely remain in a running or stopping status when the Transformer machine has limited memory or when a large number of pipelines start on a single Transformer.

To avoid this issue, run the following command on each Transformer machine to decrease the amount of memory available to the Spark submit process:
export SPARK_SUBMIT_OPTS="-Xmx64m"