Installation Requirements

Choose a Transformer installation package based on the clusters that you want to run pipelines on and the Transformer features that you want to use.

The Scala version that Transformer is built with determines the Java JDK version that must be installed on the Transformer machine and the Spark versions that you can use with Transformer. The Spark version that you choose determines the cluster types and the Transformer features that you can use.

For example, Amazon EMR 6.1 clusters use Spark 3.x. To run Transformer pipelines on those clusters, you use an installation package for Transformer prebuilt with Scala 2.12. And since Transformer prebuilt with Scala 2.12 requires Java JDK 11, you install that JDK version on the Transformer machine.

For more information, see Cluster Compatibility Matrix, Scala, Spark, and Java JDK Requirements, and Spark Versions and Available Features.

Also note the other Transformer requirements in this section.

System Requirements

Install Transformer on a machine that meets the following requirements:

Component Minimum System Requirement
Cores 2
Disk space 6 GB
Note: StreamSets does not recommend using NFS or NAS to store Transformer files.
File descriptors 32768
Operating system One of the following operating systems and versions:
  • CentOS 6.x or 7.x
  • Oracle Linux 6.x - 8.x
  • Red Hat Enterprise Linux 6.x - 8.x
  • Ubuntu 14.04 LTS or 16.04 LTS
RAM 1 GB

Cluster Compatibility Matrix

The following matrix shows the Transformer Scala version that is required for supported cluster and underlying Spark versions.

You can use this matrix to determine the Transformer installation package to install.

Cluster Type Supported Cluster Versions Cluster Underlying Spark Version Transformer Scala Version
Amazon EMR 5.20.0 or later 5.x 2.4.x Scala 2.117
6.1 and later 6.x 3.x Scala 2.12
7.x 3.x Scala 2.12
Amazon EMR Serverless 6.9.0 and later 6.x 3.x Scala 2.12
7.x 3.x Scala 2.12
Azure for HDInsight 4.0 6 2.4.x Scala 2.117
Cloudera Data Engineering 1.3.x 2.4.x Scala 2.117
1.3.3 and later 1.3.x 3.x Scala 2.12
Databricks 5.x - 6.x 6 2.4.x Scala 2.117
7.x 6 3.0.1 Scala 2.12
8.x 6 3.1.1 Scala 2.12
9.1 3.1.2 Scala 2.12
10.4 3.2.1 Scala 2.12
11.3 3.3.0 Scala 2.12
12.2 3.3.2 Scala 2.12
13.3 3.4.1 Scala 2.12
14.3 3.5.0 Scala 2.12
Google Dataproc 1.3 2.3.4 Scala 2.117
1.4 2.4.8 Scala 2.117
2.0.0 - 2.0.39 3.0.0 - 3.1.2 Scala 2.12
2.0.40 and later 2.0.x 3.1.3 Scala 2.12
2.1 3.3.0 Scala 2.12
2.2 3.5.0 Scala 2.12
Hadoop YARN 1

Cloudera distribution

CDH 5.9.x and later 5.x 6, 2

CDH 6.1.x and later 6.x 6

CDP Private Cloud Base 7.1.x

2.3.0 and later 2.x Scala 2.117 with Java JDK 8
CDP Private Cloud Base 7.1.x 3.x Scala 2.12
Hadoop YARN 1

Hortonworks distribution

3.1.0.0 6 2.3.0 and later 2.x Scala 2.117
Hadoop YARN 1

MapR distribution 3

6.1.0 2.3.0 and later 2.x Scala 2.117
7.0 3.2.0 4 Scala 2.12
Microsoft SQL Server 2019 Big Data Cluster SQL Server 2019 Cumulative Update 5 or later 6 2.3.0 or later 2.x Scala 2.117
Spark Standalone Cluster 5 NA NA Any

1 Before using a Hadoop YARN cluster, complete all required tasks.

2 If using CDH 5.x.x, you must first install CDS Powered by Apache Spark version 2.3 Release 3 or higher on the cluster.

3 Before using MapR, complete the prerequisite tasks.
Note: MapR is now HPE Ezmeral Data Fabric. This documentation uses "MapR" to refer to both MapR and HPE Ezmeral Data Fabric.

4 The MapR 7.0 distribution requires Ezmeral Ecosystem Pack (EEP) 8.1.0, which includes Spark 3.2.0.

5 Spark Standalone clusters are supported for development workloads only.

6 These clusters have been deprecated and are no longer tested with Transformer.

7 Support for Spark 2.x and Transformer prebuilt with Scala 2.11 has been deprecated.

Scala, Spark, and Java JDK Requirements

Transformer requires that the appropriate versions of Scala, Spark, and Java JDK are installed.

Note: Support for Spark 2.x and Transformer prebuilt with Scala 2.11 has been deprecated.
Scala match requirement
To run cluster pipelines, the Scala version on the clusters must match the Scala version prebuilt in Transformer. If you install Spark on the Transformer machine, the Scala version prebuilt in the Spark installation must also match the Scala version prebuilt in Transformer.

You can view the Scala version prebuilt with Transformer by clicking the Help icon in the upper right corner of the user interface, then clicking About.

Spark requirement
The Spark version that you install on clusters or the Transformer machine depends on the Transformer installation that you use:
  • For Transformer prebuilt with Scala 2.11, install Spark 2.x prebuilt with Scala 2.11.

    In general, most Spark 2.x installation packages are prebuilt with Scala 2.11. However, most Spark 2.4.2 installation packages are prebuilt with Scala 2.12.x instead.

  • For Transformer prebuilt with Scala 2.12, install Spark 3.x, which is prebuilt with Scala 2.12.
Java JDK requirement
The Java Development Kit (JDK) version that you must install on the Transformer machine depends on the Transformer installation package that you use:
  • Scala 2.11 - Requires Java JDK 8.
  • Scala 2.12 - Requires Java JDK 11.

Spark Versions and Available Features

The Spark version on a cluster determines the Transformer features that you can use in pipelines that the cluster runs. The Spark version that you install on the Transformer machine determines the features that you can use in local and standalone pipelines.

Transformer does not need a local Spark installation to run cluster pipelines. However, Transformer does require a local Spark installation to perform certain tasks, such as using embedded Spark libraries to preview or validate pipelines, and starting pipelines in client deployment mode on Hadoop YARN clusters.

Important: Support for Spark 2.x and Transformer prebuilt with Scala 2.11 has been deprecated. Also, StreamSets does not provide support for Spark installations.
The following table describes the features available with different Spark versions:
Spark Version Features
Apache Spark 2.3.x Provides access to all Transformer features, except those listed below.
Apache Spark 2.4.0 and later Provides access to the following additional features:
  • JDBC destination: Using the Write to a Slowly Changing Dimension write mode
  • JDBC Query origin: Reading from most database vendors
  • JSON Parser processor: Inferring the JSON schema from incoming data
  • Kafka origin and destination: Processing Avro data and Kafka message keys
  • Snowflake origin: Pushdown optimization
  • XML Parser processor: Inferring the XML schema from incoming data
Apache Spark 2.4.2 and later Provides access to the following additional features:
  • Delta Lake stages
  • Hive destination: Using partitioned columns
Apache Spark 2.4.4 and later Provides access to the following additional feature:
  • JDBC Query origin: Reading from Oracle databases
Apache Spark 3.0.0 and later Provides access to the following additional feature:
  • JSON Parser processor: Error handling mode

When you use Spark 3.0.0 or later, the following features are not available at this time:

  • Elasticsearch stages
When you use Spark 3.2.x, the following feature is not available at this time:
  • Azure SQL destination

Spark Shuffle Service Requirement

To run a pipeline on a Spark cluster, Transformer requires that the Spark external shuffle service be enabled on the cluster.

Most Spark clusters have the external shuffle service enabled by default. However, Hortonworks clusters do not.

Before you run a pipeline on a Hortonworks cluster, enable the Spark external shuffle service on Hortonworks clusters. Enable the shuffle service on other clusters as needed.

When you run a pipeline on a cluster without the Spark external shuffle service enabled, the following error is written to the Spark log:
org.apache.spark.SparkException: Dynamic allocation of executors requires the external shuffle service. You may enable this through spark.shuffle.service.enabled. 

For more information about enabling the Spark external shuffle service, see the Spark documentation.

Configure the Open File Limit

Transformer requires a large number of file descriptors to work correctly with all stages. Most operating systems provide a configuration to limit the number of files a process or a user can open. The default values are usually less than the Transformer requirement of 32768 file descriptors.

Use the following command to verify the configured limit for the current user:
ulimit -n

Most operating systems use two ways of configuring the maximum number of open files - the soft limit and the hard limit. The hard limit is set by the system administrator. The soft limit can be set by the user, but only up to the hard limit.

Increasing the open file limit differs for each operating system. Consult your operating system documentation for the preferred method.

Increase the Limit on Linux

To increase the open file limit on Linux, see the following solution: How to set ulimit values.

This solution should work on Red Hat Enterprise Linux, Oracle Linux, CentOS, and Ubuntu. However, refer to the administrator documentation for your operating system for the preferred method.

Increase the Limit on Mac OS

The method you use to increase the limit on Mac OS can differ with each version. Refer to the documentation for your operating system version for the preferred method.

To increase the limit for the computer so that the limits are retained after relaunching the terminal and restarting the computer, create a property list file. The following steps are valid for Mac OS Yosemite, El Capitan, and Sierra:

  1. Use the following command to create a property list file named limit.maxfiles.plist:
    sudo vim /Library/LaunchDaemons/limit.maxfiles.plist
  2. Add the following contents to the file, modifying the maxfiles attribute as needed.

    The maxfiles attribute defines the open file limit. The first value in the file is the soft limit. The second value is the hard limit.

    For example, in the following limit.maxfiles.plist file, both the soft and hard limit are set to 32,768:

    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
    <plist version="1.0">
      <dict>
        <key>Label</key>
        <string>limit.maxfiles</string>
        <key>ProgramArguments</key>
        <array>
          <string>launchctl</string>
          <string>limit</string>
          <string>maxfiles</string>
          <string>32768</string>
          <string>32768</string>
        </array>
        <key>RunAtLoad</key>
        <true/>
        <key>ServiceIPC</key>
        <false/>
      </dict>
    </plist>
  3. Use the following commands to load the new settings:
    sudo launchctl unload -w /Library/LaunchDaemons/limit.maxfiles.plist
    sudo launchctl load -w /Library/LaunchDaemons/limit.maxfiles.plist
  4. Use the following command to check that the system limits were modified:
    launchctl limit maxfiles
  5. Use the following command to set the session limit:
    ulimit -n 32768

Default Port

Transformer uses either HTTP or HTTPS that runs over the TCP protocol. Configure network routes and firewalls so that web browsers and the Spark cluster can reach the Transformer IP address.

For example, if your Transformer is installed on EC2 and you run pipelines on EMR, make sure that the EMR cluster can access Transformer on EC2.

The default port number that Transformer uses depends on the configured protocol:
  • HTTP - Default is 19630.
  • HTTPS - Default depends on the configuration. For more information, see Enabling HTTPS.

Browser Requirements

Use the latest version of one of the following browsers to access the Transformer UI:
  • Chrome
  • Firefox
  • Safari

Docker Image Requirement

For a Docker image installation of Transformer, the machine must also have Docker installed.

Hadoop YARN Requirements

Before using a Hadoop YARN cluster, complete the following requirements:

Directories

When using a Hadoop YARN cluster, the following directories must exist:
Spark node local directories
The Spark yarn.nodemanager.local-dir configuration parameter in the yarn-site.xml file defines one or more directories that must exist on each Spark node.
The value of the configuration parameter should be available in the cluster manager user interface. By default, the property is set to ${hadoop.tmp.dir}/nm-local-dir.
The specified directories must meet all of the following requirements for each node of the cluster:
  • Exist on each node of the cluster.
  • Be owned by YARN.
  • Have read permission granted to the Transformer proxy user.
HDFS application resource directories
Spark stores resources for all Spark applications started by Transformer in the HDFS home directory of the Transformer proxy user. Home directories are named after the Transformer proxy user, as follows:
/user/<Transformer proxy user name>
Ensure that both of the following requirements are met:
  • Each resource directory exists on HDFS.
  • Each Transformer proxy user has read and write permission on their resource directory.
For example, you might use the following command to add a Transformer user, tx, to a spark user group:
usermod -aG spark tx
Then, you can use the following commands to create the /user/tx directory and ensure that the spark user group has the correct permissions to access the directory:
sudo -u hdfs hdfs dfs -mkdir /user/tx
sudo -u hdfs hdfs dfs -chown tx:spark /user/tx
sudo -u hdfs hdfs dfs -chmod -R 775 /user/tx

JDBC Driver

When you run pipelines on older distributions of Hadoop YARN clusters, the cluster can have an older JDBC driver on the classpath that takes precedence over the JDBC driver required for the pipeline. This can be a problem for PostgreSQL and SQL Server JDBC drivers.

When a pipeline encounters this issue, it generates a SQLFeatureNotSupportedException error, such as:

java.sql.SQLFeatureNotSupportedException: This operation is not supported.

To avoid this issue, update the PostgreSQL and SQL Server JDBC drivers on the cluster to the latest available versions.

Memory

When you run pipelines on a Hadoop YARN cluster, the Spark submit process continues to run until the pipeline finishes, which uses memory on the Transformer machine. This memory usage can cause pipelines to indefinitely remain in a running or stopping status when the Transformer machine has limited memory or when a large number of pipelines start on a single Transformer.

To avoid this issue, run the following command on each Transformer machine to decrease the amount of memory available to the Spark submit process:
export SPARK_SUBMIT_OPTS="-Xmx64m"