General Installation Requirements
Choose a Transformer engine version based on the clusters that you want to run pipelines on and the Transformer features that you want to use.
The Scala version that Transformer is built with determines the Java JDK version that must be installed on the Transformer machine and the Spark versions that you can use with Transformer. The Spark version that you choose determines the cluster types and the Transformer features that you can use.
For example, Amazon EMR 6.1 clusters use Spark 3.x. To run Transformer pipelines on those clusters, you use Transformer prebuilt with Scala 2.12. And since Transformer prebuilt with Scala 2.12 requires Java JDK 11, you install that JDK version on the Transformer machine.
For more information, see Cluster Compatibility Matrix, Scala, Spark, and Java JDK Requirements, and Spark Versions and Available Features.
Also note the other Transformer requirements in this section.
Choosing an Engine Version
StreamSets provides Transformer engine versions prebuilt with different versions of Scala.
You can use Transformer prebuilt with the following Scala versions:
- Scala 2.11 - Use with Spark 2.x. Requires Java JDK 8. Note: Support for Spark 2.x and Transformer prebuilt with Scala 2.11 has been deprecated.
- Scala 2.12 - Use with Spark 3.x. Requires Java JDK 11.
Choose an engine version based on the clusters that you want to run pipelines on and the Transformer features that you want to use.
The Scala version that Transformer is built with determines the Java JDK version that must be installed on the Transformer machine and the Spark versions that you can use with Transformer. The Spark version that you choose determines the cluster types and the Transformer features that you can use.
For example, Amazon EMR 6.1 clusters use Spark 3.x. To run Transformer pipelines on those clusters, you use Transformer prebuilt with Scala 2.12. And since Transformer prebuilt with Scala 2.12 requires Java JDK 11, you install that JDK version on the Transformer machine.
For more information, see Cluster Compatibility Matrix, Scala, Spark, and Java JDK Requirements, and Spark Versions and Available Features.
Cluster Compatibility Matrix
The following matrix shows the Transformer Scala version that is required for supported cluster and underlying Spark versions.
You can use this matrix to determine the Transformer engine version to use in your deployment.
Cluster Type | Supported Cluster Versions | Cluster Underlying Spark Version | Transformer Scala Version |
---|---|---|---|
Amazon EMR | 5.20.0 or later 5.x | 2.4.x | Scala 2.117 |
6.1 and later 6.x | 3.x | Scala 2.12 | |
7.x | 3.x | Scala 2.12 | |
Amazon EMR Serverless | 6.9.0 and later 6.x | 3.x | Scala 2.12 |
7.x | 3.x | Scala 2.12 | |
Azure for HDInsight | 4.0 6 | 2.4.x | Scala 2.117 |
Cloudera Data Engineering | 1.3.x | 2.4.x | Scala 2.117 |
1.3.3 and later 1.3.x | 3.x | Scala 2.12 | |
Databricks | 5.x - 6.x 6 | 2.4.x | Scala 2.117 |
7.x 6 | 3.0.1 | Scala 2.12 | |
8.x 6 | 3.1.1 | Scala 2.12 | |
9.1 | 3.1.2 | Scala 2.12 | |
10.4 | 3.2.1 | Scala 2.12 | |
11.3 | 3.3.0 | Scala 2.12 | |
12.2 | 3.3.2 | Scala 2.12 | |
13.3 | 3.4.1 | Scala 2.12 | |
14.3 | 3.5.0 | Scala 2.12 | |
Google Dataproc | 1.3 | 2.3.4 | Scala 2.117 |
1.4 | 2.4.8 | Scala 2.117 | |
2.0.0 - 2.0.39 | 3.0.0 - 3.1.2 | Scala 2.12 | |
2.0.40 and later 2.0.x | 3.1.3 | Scala 2.12 | |
2.1 | 3.3.0 | Scala 2.12 | |
2.2 | 3.5.0 | Scala 2.12 | |
Hadoop YARN 1 Cloudera distribution |
CDH 5.9.x and later 5.x 6, 2 CDH 6.1.x and later 6.x 6 CDP Private Cloud Base 7.1.x |
2.3.0 and later 2.x | Scala 2.117 with Java JDK 8 |
CDP Private Cloud Base 7.1.x | 3.x | Scala 2.12 | |
Hadoop YARN 1 Hortonworks distribution |
3.1.0.0 6 | 2.3.0 and later 2.x | Scala 2.117 |
Hadoop YARN 1 MapR distribution 3 |
6.1.0 | 2.3.0 and later 2.x | Scala 2.117 |
7.0 | 3.2.0 4 | Scala 2.12 | |
Microsoft SQL Server 2019 Big Data Cluster | SQL Server 2019 Cumulative Update 5 or later 6 | 2.3.0 or later 2.x | Scala 2.117 |
Spark Standalone Cluster 5 | NA | NA | Any |
1 Before using a Hadoop YARN cluster, complete all required tasks.
2 If using CDH 5.x.x, you must first install CDS Powered by Apache Spark version 2.3 Release 3 or higher on the cluster.
4 The MapR 7.0 distribution requires Ezmeral Ecosystem Pack (EEP) 8.1.0, which includes Spark 3.2.0.
5 Spark Standalone clusters are supported for development workloads only.
6 These clusters have been deprecated and are no longer tested with Transformer.
7 Support for Spark 2.x and Transformer prebuilt with Scala 2.11 has been deprecated.
Scala, Spark, and Java JDK Requirements
Transformer requires that the appropriate versions of Scala, Spark, and Java JDK are installed.
- Scala match requirement
- To run cluster pipelines, the Scala version on the clusters must match the Scala version prebuilt in Transformer. If you install Spark on the Transformer machine, the Scala version prebuilt in the Spark installation must also match the Scala version prebuilt in Transformer.
- Spark requirement
- The Spark version that you install on clusters or the Transformer machine depends on the Transformer installation that you use:
- For Transformer prebuilt with Scala 2.11, install Spark 2.x prebuilt with Scala
2.11.
In general, most Spark 2.x installation packages are prebuilt with Scala 2.11. However, most Spark 2.4.2 installation packages are prebuilt with Scala 2.12.x instead.
- For Transformer prebuilt with Scala 2.12, install Spark 3.x, which is prebuilt with Scala 2.12.
- For Transformer prebuilt with Scala 2.11, install Spark 2.x prebuilt with Scala
2.11.
- Java JDK requirement
- The Java Development Kit (JDK) version that you must install on the Transformer machine depends on the Scala
version associated with the Transformer engine version for the deployment:
- Scala 2.11 - Requires Java JDK 8.
- Scala 2.12 - Requires Java JDK 11.
Spark Versions and Available Features
The Spark version on a cluster determines the Transformer features that you can use in pipelines that the cluster runs. The Spark version that you install on the Transformer machine determines the features that you can use in local and standalone pipelines.
Transformer does not need a local Spark installation to run cluster pipelines. However, Transformer does require a local Spark installation to perform certain tasks, such as using embedded Spark libraries to preview or validate pipelines, and starting pipelines in client deployment mode on Hadoop YARN clusters.
Spark Version | Features |
---|---|
Apache Spark 2.3.x | Provides access to all Transformer features, except those listed below. |
Apache Spark 2.4.0 and later | Provides access to the following additional features:
|
Apache Spark 2.4.2 and later | Provides access to the following additional features:
|
Apache Spark 2.4.4 and later | Provides access to the following additional feature:
|
Apache Spark 3.0.0 and later | Provides access to the following additional feature:
When you use Spark 3.0.0 or later, the following features are not available at this time:
When you use Spark 3.2.x, the following feature is not
available at this time:
|
Spark Shuffle Service Requirement
To run a pipeline on a Spark cluster, Transformer requires that the Spark external shuffle service be enabled on the cluster.
Most Spark clusters have the external shuffle service enabled by default. However, Hortonworks clusters do not.
Before you run a pipeline on a Hortonworks cluster, enable the Spark external shuffle service on Hortonworks clusters. Enable the shuffle service on other clusters as needed.
org.apache.spark.SparkException: Dynamic allocation of executors requires the external shuffle service. You may enable this through spark.shuffle.service.enabled.
For more information about enabling the Spark external shuffle service, see the Spark documentation.
Default Port
Transformer uses either HTTP or HTTPS that runs over the TCP protocol. Configure network routes and firewalls so that the Spark cluster can reach the Transformer IP address.
For example, if your Transformer is installed on EC2 and you run pipelines on EMR, make sure that the EMR cluster can access Transformer on EC2.
- HTTP - Default is 19630.
- HTTPS - Default depends on the configuration. For more information, see Enabling HTTPS.
Hadoop YARN Requirements
- Create the required directories.
- Update JDBC drivers on older distributions, as needed.
- Decrease the amount of memory available to the Spark submit process, as needed.
Directories
- Spark node local directories
- The Spark
yarn.nodemanager.local-dir
configuration parameter in the yarn-site.xml file defines one or more directories that must exist on each Spark node. - HDFS application resource directories
- Spark stores resources for all Spark applications started by Transformer in the HDFS home directory of the Transformer proxy user. Home directories are named after the Transformer proxy user, as
follows:
/user/<Transformer proxy user name>
JDBC Driver
When you run pipelines on older distributions of Hadoop YARN clusters, the cluster can have an older JDBC driver on the classpath that takes precedence over the JDBC driver required for the pipeline. This can be a problem for PostgreSQL and SQL Server JDBC drivers.
When a pipeline encounters this issue, it generates a SQLFeatureNotSupportedException error, such as:
java.sql.SQLFeatureNotSupportedException: This operation is not supported.
To avoid this issue, update the PostgreSQL and SQL Server JDBC drivers on the cluster to the latest available versions.
Memory
When you run pipelines on a Hadoop YARN cluster, the Spark submit process continues to run until the pipeline finishes, which uses memory on the Transformer machine. This memory usage can cause pipelines to indefinitely remain in a running or stopping status when the Transformer machine has limited memory or when a large number of pipelines start on a single Transformer.
export SPARK_SUBMIT_OPTS="-Xmx64m"