Overview
Transformer pipelines run on Spark. Generally, you run Transformer pipelines on Spark deployed on a cluster to leverage the performance and scale that Spark offers. Though, when needed, you can run a local pipeline on the Transformer machine.
When running a pipeline on a cluster, Transformer submits the pipeline as a Spark application to the cluster. Spark distributes the processing across the nodes in the cluster.
You specify the cluster to run a pipeline on the Cluster tab of the pipeline properties. Then, you configure related cluster properties.
- Amazon EMR
- Apache Spark for Azure HDInsight
- Databricks
- Google Dataproc
- Hadoop YARN
- SQL Server 2019 Big Data Cluster
- Spark Standalone - Spark Standalone clusters are supported for development workloads only.
For more information about supported versions and distributions, see the Installation chapter.