Hadoop YARN

You can run Transformer pipelines using Spark deployed on a Hadoop YARN cluster. Transformer supports several distributions of Hadoop YARN. For a complete list, see Cluster Compatibility Matrix.

To run a pipeline on a Hadoop YARN cluster, configure the pipeline to use Hadoop YARN as the cluster manager type on the Cluster tab of pipeline properties.
Important: The Hadoop YARN cluster must be able to access Transformer to send the status, metrics, and offsets for running pipelines. Grant the cluster access to the Transformer URL, as described in the installation instructions.

Before running a pipeline on a Hadoop YARN cluster, ensure all requirements are met. Before running a pipeline on a MapR Hadoop YARN cluster, complete the prerequisite tasks.

When you configure a pipeline to run on a Hadoop YARN cluster, you configure the deployment mode used for the launched application. By default, Transformer uses the user who starts the pipeline as the proxy user to launch the Spark application and access files in the Hadoop system. If you enable Transformer to use Kerberos authentication or Hadoop impersonation, you can override the default proxy user that launches the Spark application.

The following image displays a pipeline configured to run on Spark deployed to a Hadoop YARN cluster:

Notice how this pipeline is configured to run in cluster deployment mode. The Hadoop user name is not defined because the pipeline is configured to use Kerberos authentication.

Deployment Mode

Cluster pipelines on Hadoop YARN can use one of the following deployment modes:

Client

In client deployment mode, the Spark driver program is launched on the local machine outside of the cluster. Use client mode when the Transformer machine is physically co-located with the cluster worker machines.

Cluster
In cluster deployment mode, the Spark driver program is launched remotely on one of the worker nodes inside the cluster. Use cluster mode when the Transformer machine is physically located far from the worker machines. In this case, using cluster mode minimizes network latency between the drivers and the executors.
Note: Spark uses a YARN container for the driver for each pipeline.

For more information about deployment modes, see the Apache Spark documentation.

Transformer Proxy Users

To ensure that Transformer pipelines run as expected, ensure that all Transformer proxy users have permissions on the required directories.

By default, Transformer uses the user who starts a pipeline as a proxy user to launch the Spark application and to access files in the Hadoop system.

When the Hadoop YARN cluster uses Kerberos authentication, you must enable proxy users for Kerberos in the Transformer installation. Or, you can configure individual pipelines to use a Kerberos principal and keytab to override the default proxy user.

When Transformer uses Hadoop impersonation without Kerberos authentication, you can configure a Hadoop user in individual pipelines to override the default proxy user, the user who starts the pipeline.

You can use a Transformer configuration property to prevent overriding the proxy user. This option is highly recommended. It ensures that the user who starts the pipeline is always used as the proxy user and prevents users from entering a different user name in pipeline properties.

Kerberos Authentication

When the Hadoop YARN cluster uses Kerberos authentication, Transformer uses the user who starts the pipeline as the proxy user to launch the Spark application and to access files in the Hadoop system, unless you configure a Kerberos principal and keytab for the pipeline.

Using a Kerberos principal and keytab enables Spark to renew Kerberos tokens as needed, and is strongly recommended.

For example, you should configure a Kerberos principal and keytab for long-running pipelines, such as streaming pipelines, so that the Kerberos token can be renewed by Spark. If Transformer uses a proxy user for a pipeline that runs for longer than the maximum lifetime of the Kerberos token, the Kerberos token expires and the proxy user cannot be authenticated.

Note: If you choose to use proxy users when the cluster uses Kerberos authentication, you first must enable proxy users for Kerberos in the Transformer installation.

For more information about submitting Spark applications to Hadoop clusters that use Kerberos authentication, see the Apache Spark documentation.

Using a Keytab for Each Pipeline

Configure pipelines to use a Kerberos keytab and specify the source of the keytab. When you do not specify a keytab source, Transformer uses the user who starts the pipeline to launch the Spark application and to access files in the Hadoop system.

When using a keytab, Transformer uses the Kerberos principal to launch the Spark application and to access files in the Hadoop system. Transformer also includes the keytab file with the launched Spark application so that the Kerberos token can be renewed by Spark.

When you enable a pipeline to use a keytab, you configure one of the following keytab sources for the pipeline:
Transformer configuration file
The pipeline uses the same Kerberos keytab and principal configured for Transformer in the Transformer configuration file.
For information about specifying the Kerberos keytab in the Transformer configuration file, see Enabling the Properties File as the Keytab Source.
Pipeline configuration - file
The pipeline uses the Kerberos keytab file and principal configured for the pipeline. Store the keytab file on the Transformer machine.
In the pipeline properties, you define the absolute path to the keytab file and the Kerberos principal to use for that keytab.
Define a specific keytab and principal for a pipeline to ensure that only authorized users access data stored in HDFS files.
Pipeline configuration - credential store
The pipeline uses the Kerberos keytab file and principal configured for the pipeline. Add the Base64-encoded keytab to a credential store, and then use a credential function to retrieve the keytab from the credential store.
Note: Be sure to remove unnecessary characters, such as newline characters, before encoding the keytab.
In the pipeline properties, you use the credential:get() or credential:getWithOptions() credential function to retrieve the keytab, and you define the Kerberos principal to use for that keytab.
For more information about using credential stores with Transformer, see Credential Stores.
Define a specific keytab and principal for a pipeline to ensure that only authorized users access data stored in HDFS files. When using a credential store, you can also require group access to credential store secrets for an additional layer of security.

Hadoop Impersonation Mode

When the Hadoop YARN cluster is configured for impersonation but not for Kerberos authentication, you can configure the Hadoop impersonation mode that Transformer uses when performing tasks in the Hadoop system.  

When not using Kerberos, Transformer impersonates Hadoop users as follows:
  • As the user defined in the pipeline properties - When configured, Transformer uses the specified Hadoop user to launch the Spark application and to access files in the Hadoop system.
  • As the currently logged in Transformer user who starts the pipeline - When no Hadoop user is defined in the pipeline properties, Transformer uses the user who starts the pipeline.
Important: When Kerberos authentication is enabled, Transformer impersonates Hadoop users as the Transformer user who starts the pipeline, or runs directly as the Kerberos principal defined for the pipeline. When Kerberos is enabled, Transformer ignores the Hadoop user defined in the pipeline properties.

The system administrator can configure Transformer to always use the user who starts the pipeline by enabling the hadoop.always.impersonate.current.user property in the Transformer configuration file. When enabled, configuring a Hadoop user within a pipeline is not allowed.

Configure Transformer to always impersonate as the user who starts the pipeline when you want to prevent access to data in Hadoop systems by the pipeline-level property.

For example, say you use roles, groups, and pipeline permissions to ensure that only authorized operators can start pipelines. You expect that the operator user accounts are used to access all external systems. But a pipeline developer can specify an HDFS user in a pipeline and bypass your attempts at security. To close this loophole, configure Transformer to always use the user who starts the pipeline to read from or write to Hadoop systems.

To always use the user who starts the pipeline, in the Transformer configuration file, uncomment the hadoop.always.impersonate.current.user property and set it to true.

Lowercasing User Names

When Transformer impersonates Hadoop users to perform tasks in Hadoop systems, you can also configure Transformer to lowercase all user names before passing them to Hadoop.

When the Hadoop system is case sensitive and the user names are lower case, you might use this property to lowercase mixed-case user names that might be returned.

To lowercase user names before passing them to Hadoop, in the Transformer configuration file, uncomment the hadoop.always.lowercase.user property and set it to true.

Working with HDFS Encryption Zones

Hadoop systems use the Hadoop Key Management Server (KMS) to obtain encryption keys. To enable access to HDFS encryption zones while using proxy users, configure KMS to allow the same user impersonation as you have configured for HDFS.

To allow Transformer as a proxy user, add the following properties to the KMS configuration file and configure the values for the properties:
  • hadoop.kms.proxyuser.<user>.groups
  • hadoop.kms.proxyuser.<user>.hosts

Where <user> is either the Hadoop user defined in the Hadoop User Name pipeline property, or the user who started Transformer if a Hadoop user is not defined.

For example, with tx as the user specified in the Hadoop User Name pipeline property, the following properties allow users in the Ops group access to the encryption zones:

<property>
<name>hadoop.kms.proxyuser.tx.groups</name>
<value>Ops</value>
</property>
<property>
<name>hadoop.kms.proxyuser.tx.hosts</name>
<value>*</value>
</property>

Note that the asterisk (*) indicates no restrictions.

For more information about configuring KMS proxyusers, see the KMS documentation for the Hadoop distribution that you are using. For example, for Apache Hadoop, see KMS Proxyuser Configuration.

MapR Prerequisites

Running a pipeline on a MapR Hadoop YARN cluster requires performing some prerequisite tasks.
Note: MapR is now HPE Ezmeral Data Fabric. This documentation uses "MapR" to refer to both MapR and HPE Ezmeral Data Fabric.
Important: Be sure that you have also configured the required MapR property when you installed Transformer.

Perform the following tasks, as needed:

Spark Dynamic Allocation Prerequisite

Before you run a pipeline on a MapR cluster, you must set up Spark dynamic allocation on the cluster.

HPE Developer provides a blog post that describes how to perform this task. Perform all of the steps described in the post, with the following change.

At this time, the "Enabling Dynamic Allocation in Apache Spark" section says to add the following entries to the /opt/mapr/spark/spark-1.6.1/conf/spark-defaults.conf file:

spark.dynamicAllocation.enabled = true
spark.shuffle.service.enabled = true
spark.dynamicAllocation.minExecutors = 5 
spark.executor.instances = 0

Setting spark.executor.instances to 0 generates an error. Instead, set spark.executor.instances to 1 or higher, up to the maximum number of executors allowed in the Transformer instance.

Hadoop Impersonation Prerequisites

Transformer can impersonate a Hadoop user defined in the pipeline to launch the Spark application and to access services in the Hadoop system. To do this, you configure the Hadoop User Name property on the Cluster tab of the pipeline properties.

To enable impersonating a Hadoop user defined in the pipeline, you must complete some prerequisite tasks on the MapR cluster. When impersonating the user who starts the pipeline, these tasks are unnecessary.

The tasks to perform differ depending on whether the cluster is secured.
Secure clusters
A secure cluster requires username-password or Kerberos authentication. Complete the following tasks to enable impersonating a Hadoop user defined in a pipeline:
  1. Generate impersonation tickets.

    Create impersonation tickets on the MapR node where Transformer is installed. When you create the tickets, you specify the location to store them.

  2. Set the MAPR_TICKET_LOCATION environment variable.

    On the MapR node where Transformer is installed, set the MAPR_TICKET_LOCATION environment variable to the location where the impersonation tickets are stored.

    For example:
    export MAPR_TICKET_LOCATION=/var/tmp/imp-tickets
For details on performing these tasks, see the MapR documentation.
Non-secure clusters
A non-secure cluster does not require authentication for access. Complete the following tasks to enable impersonating a Hadoop user defined in a pipeline:
  1. Set the MAPR_IMPERSONATION_ENABLED environment variable.

    On the MapR node where Transformer is installed, set the MAPR_TICKET_LOCATION environment variable to true.

  2. Create a proxy file.
    Create a proxy file in the following location:
    /opt/mapr/conf/proxy/
For details on performing these tasks, see the MapR documentation.

Pipeline Start Prerequisite

Before starting a pipeline that runs on a secured MapR cluster, you must log into the MapR node where Transformer is installed. This generates a ticket of a specified duration.

You can then start Transformer pipelines through the duration of the generated ticket.

Use the maprlogin command to log into MapR. The argument that you use depends on the authentication that you use:
  • Use the password argument for username-password authentication.
  • Use the kerberos argument for Kerberos authentication.

For more information about the maprlogin command, see the MapR documentation.