Pipelines
Use the following tips for help with pipeline errors:
- A pipeline fails to start with the following error:
-
org.apache.spark.SparkException: Dynamic allocation of executors requires the external shuffle service. You may enable this through spark.shuffle.service.enabled.
- A pipeline fails to start with the following error:
This error occurs when you try to run a pipeline on an existing Databricks cluster that has previously run pipelines built on a different version of Transformer. This is not allowed.TRANSFORMER_03 Databricks shared cluster <name> already contains StreamSets libraries from a different staging directory <directory>. Either replace 'Pipeline Config > Staging Directory' config value to <directory> or uninstall all libraries from the shared cluster and restart the cluster.
- A pipeline run, validation, or preview fails with errors
- You might see a message such as the following, however different messages might display depending on the cluster that runs the pipeline:
- A pipeline fails with the following run error:
-
org.apache.spark.sql.AnalysisException: Found duplicate column(s)
- Pipeline validation fails with the following stage library/cluster manager mismatch error:
-
VALIDATION_0300 Stage <stage name> using the <stage library name> stage library cannot be used with the <cluster type> cluster manager type
- Pipelines running on a Hadoop YARN cluster indefinitely remain in a running or stopping status.
-
When you run pipelines on a Hadoop YARN cluster, the Spark submit process continues to run until the pipeline finishes, which uses memory on the Transformer machine. This memory usage can cause pipelines to indefinitely remain in a running or stopping status when the Transformer machine has limited memory or when a large number of pipelines start on a single Transformer.
To avoid this issue, run the following command on each Transformer machine to decrease the amount of memory available to the Spark submit process:export SPARK_SUBMIT_OPTS="-Xmx64m"