Pipelines

Use the following tips for help with pipeline errors:
A pipeline fails to start with the following error:
org.apache.spark.SparkException: Dynamic allocation of executors requires the external shuffle service. You may enable this through spark.shuffle.service.enabled. 
This error occurs when the Spark cluster that runs the pipeline does not have the Spark external shuffle service enabled. For more information, see Spark Shuffle Service Requirement.
A pipeline fails to start with the following error:
TRANSFORMER_03 Databricks shared cluster <name> already contains StreamSets libraries from a different staging directory <directory>. Either replace 'Pipeline Config > Staging Directory' config value to <directory> or uninstall all libraries from the shared cluster and restart the cluster.
This error occurs when you try to run a pipeline on an existing Databricks cluster that has previously run pipelines built on a different version of Transformer. This is not allowed.
There are several possible solutions to this issue:
  • If you still want to run pipelines of the other version on the cluster, then run the pipeline of the new version on a different cluster. To do this, you can either:
    • Specify a different existing cluster to run the pipeline. Specify a cluster that has not run Transformer or has only run pipelines built on the same version of Transformer.
    • Configure the pipeline to provision a cluster. Then, Databricks spins up a new cluster to run the pipeline, bypassing any version issues.
  • If you no longer need to run pipelines of the other version on the cluster, then you can uninstall the Transformer libraries and restart the cluster. Then, you can run this pipeline on the cluster.
A pipeline preview, validation, or run fails with the following error:
TRANSFORMER_02 Failed to <preview, validate, or run> pipeline, check logs for error. The Transformer Spark application in the Spark cluster might not be able to reach Transformer at <URL>. If this is not the correct URL, update the transformer.base.http.url property in the Transformer configuration file or define a cluster callback URL for the pipeline and restart Transformer.
This error occurs when Spark cannot communicate with Transformer using the properties configured in the Transformer configuration properties.
To resolve this issue, verify that the transformer.base.http.url property is configured with the correct URL and verify that the Spark cluster has access to Transformer at this URL. For information about granting the Spark cluster access to other machines, see the documentation for your Spark vendor.
If the issue persists, then you might need to define a cluster callback URL for the pipeline to override the Transformer URL that Spark uses for that pipeline. Define a cluster callback URL when the web browser and the Spark cluster must use a different URL to access Transformer. For more information, see Cluster Callback URL.
A pipeline fails with the following run error:
org.apache.spark.sql.AnalysisException: Found duplicate column(s) 
The Join processor generates this error when records from both inputs contain the same field names and those fields are not used as matching fields.
To resolve this issue, select the Add Prefix to Field Names property and specify at least one prefix, for either the left input or the right input
Pipeline validation fails with the following stage library/cluster manager mismatch error:
VALIDATION_0300 Stage <stage name> using the <stage library name> stage library cannot be used with the <cluster type> cluster manager type
The stage library selected for a stage must be valid for the cluster manager type configured for the pipeline.
For example, if the pipeline uses no cluster manager type because you want it to run locally, stages with a Stage Library property cannot be configured to use a cluster-provided library.
Select a valid stage library for the stage or change the pipeline cluster manager type, as appropriate.
Pipelines running on a Hadoop YARN cluster indefinitely remain in a running or stopping status.

When you run pipelines on a Hadoop YARN cluster, the Spark submit process continues to run until the pipeline finishes, which uses memory on the Transformer machine. This memory usage can cause pipelines to indefinitely remain in a running or stopping status when the Transformer machine has limited memory or when a large number of pipelines start on a single Transformer.

To avoid this issue, run the following command on each Transformer machine to decrease the amount of memory available to the Spark submit process:
export SPARK_SUBMIT_OPTS="-Xmx64m"