Overview

Transformer pipelines run on Spark. Generally, you run Transformer pipelines on Spark deployed on a cluster to leverage the performance and scale that Spark offers. Though, when needed, you can run a local pipeline on the Transformer machine.

When running a pipeline on a cluster, Transformer submits the pipeline as a Spark application to the cluster. Spark distributes the processing across the nodes in the cluster.

You specify the cluster to run a pipeline on the Cluster tab of the pipeline properties. Then, you configure related cluster properties.

Transformer can run pipelines on the following cluster types:

Amazon EMR
Amazon EMR Serverless
Cloudera Data Engineering
Databricks
Google Dataproc
Hadoop YARN
Spark Standalone - Spark Standalone clusters are supported for development workloads only.

For more information about supported versions and distributions, see the Installation chapter.

Spark Job Reporting

When you run a pipeline, Transformer submits the Spark application to the cluster as a Spark job. You can view information about the Spark job in your cluster reporting as well as in Control Hub reporting.

Note: When cluster details about the job contradict information provided by Control Hub, you should assume that the Control Hub reporting is accurate.

For example, when pipeline processing is complete, Transformer might stop the Spark job before it gracefully completes. This leads the cluster to report that the job has failed. However, Control Hub correctly reports that the job completed successfully.