Installing when Spark Runs on a Cluster

In a production environment, install Transformer on a machine that is configured to submit Spark jobs to a cluster.

All users can install Transformer from a tarball and run it manually. Users with an enterprise account can install Transformer from an RPM package and run it as a service. Installing an RPM package requires root privileges.

When you install from an RPM package, Transformer uses the default directories and runs as the default system user and group. The default system user and group are named transformer. If a transformer user and a transformer group do not exist on the machine, the installation creates the user and group for you and assigns them the next available user ID and group ID.
Tip: To use specific IDs for the transformer user and group, create the user and group before installation and specify the IDs that you want to use. For example, if you're installing Transformer on multiple machines, you might want to create the system user and group before installation to ensure that the user ID and group ID are consistent across the machines.

Before you start, ensure that the machine meets all installation requirementsself-managed deployment and general installation requirements and choose the engine versioninstallation package that you want to use.

  1. Download the Transformer installation package from one of the following locations:

    If using the RPM package, download the appropriate package for your operating system:

    • For CentOS 6, Oracle Linux 6, or Red Hat Enterprise Linux 6, download the RPM EL6 package.
    • For CentOS 7, Oracle Linux 7, or Red Hat Enterprise Linux 7, download the RPM EL7 package.
  2. If you downloaded the tarball, use the following command to extract the tarball to the desired location:
    tar xf streamsets-transformer-all_<scala version>-<transformer version>.tgz -C <extraction directory>

    For example, to extract Transformer version 4.1.0 prebuilt with Scala 2.11.x, use the following command:

    tar xf streamsets-transformer-all_2.11-4.1.0.tgz -C /opt/streamsets-transformer/
  3. If you downloaded the RPM package, complete the following steps to extract and install the package:
    1. Use the following command to extract the package to the desired location:
      tar xf streamsets-transformer-<transformer version>-<operating system>-all-rpms.tar
      For example, to extract Transformer version 4.1.0 on CentOS 7, use the following command:
      tar xf streamsets-transformer-4.1.0-el7-all-rpms.tar
    2. To install the package, use the following command from the directory where you extracted the package:
      yum localinstall streamsets*.rpm
  4. Edit the Transformer configuration file, $TRANSFORMER_CONF/transformer.properties.
    1. Uncomment the transformer.base.http.url property and set it to the Transformer URL. If Transformer is installed on a cloud-computing platform such as Amazon Elastic Compute Cloud (EC2), define the publicly accessible URL to that instance.

      For example, if using the default Transformer port on a host machine named myhost, define the property as follows:

      transformer.base.http.url=http://myhost:19630

      Grant the Spark cluster access to Transformer at this URL. The Spark cluster must be able to access Transformer to send the status, metrics, and offsets for running pipelines. For information about granting the Spark cluster access to other machines, see the documentation for your Spark vendor.

    2. To run pipelines on a MapR Hadoop YARN cluster, uncomment the hadoop.mapr.cluster property and set it to true.

      Before running a pipeline on a MapR cluster, complete the prerequisite tasks.

    3. For an unsecured Hadoop YARN cluster, to restrict Transformer to impersonating the user who starts the pipeline when submitting Spark jobs, uncomment the hadoop.always.impersonate.current.user property and set it to true.
      On an unsecured Hadoop YARN cluster with Hadoop impersonation enabled, this is a recommended security measure. This prevents a user from impersonating another user by entering a different user name in the Hadoop User Name pipeline property.
      On a secure, Kerberos-enabled cluster, this property is not used. When enabled for Kerberos, Transformer either uses the keytab specified in the pipeline or impersonates the user who starts the pipeline, regardless of how this property is set.
  5. Add the following environment variables to the Transformer environment configuration file.

    Modify environment variables using the method required by your installation type.

    Environment Variable Description
    JAVA_HOME Path to the Java installation on the machine.
    SPARK_HOME Path to the Spark installation on the machine. Required for Hadoop YARN and Spark standalone clusters only.

    Clusters can include multiple Spark installations. Be sure to point to a supported Spark version that is valid for the Transformer features that you want to use.

    On Cloudera clusters, Spark is generally installed into the parcels directory. For example, for CDH 5.11, you might use: /opt/cloudera/parcels/SPARK2/lib/spark2.

    Tip: To verify the version of a Spark installation, you can run the spark-shell command. Then, use sc.getConf.get("spark.home") to return the installation location.
    HADOOP_CONF_DIR or YARN_CONF_DIR Directory that contains the client side configuration files for the Hadoop cluster. Required for Hadoop YARN and Spark standalone clusters only.

    For more information about these environment variables, see the Apache Spark documentation.

When you configure a pipeline, you specify cluster details in the pipeline properties. For information about each cluster type, see the Cluster Type chapter.