Overview

Transformer pipelines run on Spark. Generally, you run Transformer pipelines on Spark deployed on a cluster to leverage the performance and scale that Spark offers. Though, when needed, you can run a local pipeline on the Transformer machine.

When running a pipeline on a cluster, Transformer submits the pipeline as a Spark application to the cluster. Spark distributes the processing across the nodes in the cluster.

You specify the cluster to run a pipeline on the Cluster tab of the pipeline properties. Then, you configure related cluster properties.

Transformer can run pipelines on the following cluster types:

Amazon EMR
Apache Spark for Azure HDInsight
Databricks
Google Dataproc
Hadoop YARN
SQLServerBDC.html#concept_w5n_frw_zjb
Spark Standalone - Spark Standalone clusters are supported for development workloads only.

For more information about supported versions and distributions, see the Installation chapter.