Overview
Transformer pipelines run on Spark. Generally, you run Transformer pipelines on Spark deployed on a cluster to leverage the performance and scale that Spark offers. Though, when needed, you can run a local pipeline on the Transformer machine.
When running a pipeline on a cluster, Transformer submits the pipeline as a Spark application to the cluster. Spark distributes the processing across the nodes in the cluster.
You specify the cluster to run a pipeline on the Cluster tab of the pipeline properties. Then, you configure related cluster properties.
- Amazon EMR
- Amazon EMR Serverless
- Cloudera Data Engineering
- Databricks
- Google Dataproc
- Hadoop YARN
- Spark Standalone - Spark Standalone clusters are supported for development workloads only.
For more information about supported versions and distributions, see the Installation chapter.
Spark Job Reporting
When you run a pipeline, Transformer submits the Spark application to the cluster as a Spark job. You can view information about the Spark job in your cluster reporting as well as in Control Hub reporting.
For example, when pipeline processing is complete, Transformer might stop the Spark job before it gracefully completes. This leads the cluster to report that the job has failed. However, Control Hub correctly reports that the job completed successfully.