What is StreamSets Transformer?

StreamSets Transformer is an execution engine that runs data processing pipelines on Apache Spark, an open-source cluster-computing framework. Because Transformer pipelines run on Spark deployed on a cluster, the pipelines can perform transformations that require heavy processing on the entire data set in batch or streaming mode.

Transformer is designed to run on any type of cluster. Supported clusters include Hadoop distributions and Databricks. For a full list, see Cluster Compatibility Matrix.

You install Transformer on a machine that is configured to submit Spark jobs to a cluster, such as a Hadoop edge or data node or a cloud virtual machine. You then register Transformer to work with Control Hub.

You use a web browser to access the Control Hub user interface (UI). Within Control Hub, you design Transformer pipelines and configure a job to run the pipeline. Transformer pipelines read data from one or more origins, transform the data by performing operations on the entire data set, and then write the transformed data to destinations. Transformer pipelines can run in batch or streaming mode.

When you start a job with a Transformer pipeline, Transformer submits the pipeline as a Spark application to the cluster. Spark handles all of the pipeline processing, including performing complex transformations on the data, such as joining, sorting, or aggregating the data. As the Spark application runs, you use the Control Hub UI to monitor the progress of the pipeline, including viewing real-time statistics and any errors that might occur.

With Transformer, you can leverage the performance and scale that Spark offers without having to write your own Spark application using Java, Scala, or Python.

For more information about Control Hub, see the Control Hub documentation.