Hadoop YARN

You can run Transformer pipelines using Spark deployed on a Hadoop YARN cluster. Transformer supports several distributions of Hadoop YARN. For a complete list, see Cluster Compatibility Matrix.

To run a pipeline on a Hadoop YARN cluster, configure the pipeline to use Hadoop YARN as the cluster manager type on the Cluster tab of pipeline properties.

Important: The Hadoop YARN cluster must be able to access Transformer to send the status, metrics, and offsets for running pipelines. Grant the cluster access to the Transformer URL, as described in the installation instructionsGranting the Spark Cluster Access to Transformer.

Before running a pipeline on a Hadoop YARN cluster, ensure all requirements are met. Before running a pipeline on a MapR Hadoop YARN cluster, complete the prerequisite tasks.

When you configure a pipeline to run on a Hadoop YARN cluster, you configure the deployment mode used for the launched application. By default, Transformer uses the user who starts the pipeline as the proxy user to launch the Spark application and access files in the Hadoop system. If you enable Transformer to use Kerberos authentication or Hadoop impersonation, you can override the default proxy user that launches the Spark application.

The following image displays a pipeline configured to run on Spark deployed to a Hadoop YARN cluster:

Notice how this pipeline is configured to run in cluster deployment mode. The Hadoop user name is not defined because the pipeline is configured to use Kerberos authentication.