What is StreamSets Transformer?

StreamSets Transformer is an execution engine that runs data processing pipelines on Apache Spark, an open-source cluster-computing framework. Because Transformer pipelines run on Spark deployed on a cluster, the pipelines can perform transformations that require heavy processing on the entire data set in batch or streaming mode.

Transformer is designed to run on any type of cluster. Supported clusters include Hadoop distributions and Databricks. For a full list, see Cluster Compatibility Matrix.

You install Transformer on a machine that is configured to submit Spark jobs to a cluster, such as a Hadoop edge or data node or a cloud virtual machine. You then register Transformer to work with StreamSets Control Hub.

Transformer pipelines read data from one or more origins, transform the data by performing operations on the entire data set, and then write the transformed data to destinations. Transformer pipelines can run in batch or streaming mode.

When you start a Transformer pipeline, Transformer submits the pipeline as a Spark application to the cluster. Spark handles all of the pipeline processing, including performing complex transformations on the data, such as joining, sorting, or aggregating the data. As the Spark application runs, you use the UI to monitor the progress of the pipeline, including viewing real-time statistics and any errors that might occur.

With Transformer, you can leverage the performance and scale that Spark offers without having to write your own Spark application using Java, Scala, or Python.