Google Dataproc

You can run Transformer pipelines using Spark deployed on a Google Dataproc cluster. Transformer supports several Dataproc versions. For a complete list, see Cluster Compatibility Matrix.

To run a pipeline on a Dataproc cluster, configure the pipeline to use Dataproc as the cluster manager type on the Cluster tab of the pipeline properties.

Important: The Dataproc cluster must be able to access Transformer to send the status, metrics, and offsets for running pipelines. Grant the cluster access to the Transformer URL, as described in the installation instructionsGranting the Spark Cluster Access to Transformer.

When you configure a pipeline to run on a Dataproc cluster, you specify the Google Cloud project ID and region, and the credentials provider and related properties. You define the staging URI within Google Cloud to store the StreamSets libraries and resources needed to run the pipeline.

You can specify an existing cluster to use or you can have Transformer provision a cluster to run the pipeline.

When provisioning a cluster, you specify a cluster prefix and the Dataproc image version to use. You also select the machine types and network type to use. You can optionally define network tags for the provisioned cluster, specify the number of workers to use, and have Dataproc terminate the cluster after the pipeline stops.
Tip: Provisioning a cluster that terminates after the pipeline stops is a cost-effective method of running a Transformer pipeline. Running multiple pipelines on a single existing cluster can also reduce costs.

The following image shows the Cluster tab of a pipeline configured to run on a Dataproc cluster:

The following image shows the Dataproc tab configured to run the pipeline on an existing Dataproc cluster: