Spark Versions and Available Features

The Spark version on a cluster determines the Transformer features that you can use in pipelines that the cluster runs. The Spark version that you install on the Transformer machine determines the features that you can use in local and standalone pipelines.

Transformer does not need a local Spark installation to run cluster pipelines. However, Transformer does require a local Spark installation to perform certain tasks, such as using embedded Spark libraries to preview or validate pipelines, and starting pipelines in client deployment mode on Hadoop YARN clusters.

Important: StreamSets does not provide support for Spark installations.
The following table describes the features available with different Spark versions:
Spark Version Features
Apache Spark 2.3.x Provides access to all Transformer features, except those listed below.
Apache Spark 2.4.0 and later Provides access to the following additional features:
  • JDBC destination: Using the Write to a Slowly Changing Dimension write mode
  • JDBC Query origin: Reading from most database vendors
  • Kafka origin and destination: Processing Avro data and Kafka message keys
  • Snowflake origin: Pushdown optimization
Apache Spark 2.4.2 and later Provides access to the following additional features:
  • Delta Lake stages
  • Hive destination: Using partitioned columns
Apache Spark 2.4.4 and later Provides access to the following additional feature:
  • JDBC Query origin: Reading from Oracle databases
Apache Spark 3.0.0 and later When you use Spark 3.0.0 or later, the following features are not available at this time:
  • Azure SQL destination
  • Elasticsearch stages
  • Google Cloud stages
  • Kudu stages
  • MapR stages