What is StreamSets Transformer?StreamSets TransformerTM is an execution engine that runs data processing pipelines on Apache Spark, an open-source cluster-computing framework. Because Transformer pipelines run on Spark deployed on a cluster, the pipelines can perform transformations that require heavy processing on the entire data set in batch or streaming mode.
Pipeline Processing on SparkTransformer functions as a Spark client that launches distributed Spark applications.
Batch Case StudyTransformer can run pipelines in batch mode. A batch pipeline processes all available data in a single batch, and then stops.
Streaming Case StudyTransformer can run pipelines in streaming mode. A streaming pipeline maintains connections to origin systems and processes data at user-defined intervals. The pipeline runs continuously until you manually stop it.
Transformer for Data Collector UsersFor users already familiar with StreamSets Data Collector pipelines, here's how Transformer pipelines are similar... and different.
Tutorials and Sample PipelinesStreamSets provides tutorials and sample pipelines to help you learn about using Transformer.
OverviewYou can preview data to help build or fine-tune a pipeline. You can preview complete or incomplete pipelines.
Preview CodesIn Preview mode, Transformer displays different colors for different types of data. Transformer uses other codes and formatting to highlight changed fields.
Processor Output OrderWhen previewing data for a processor, you can preview both the input and the output data. You can display the output records in the order that matches the input records or in the order produced by the processor.
Editing PropertiesWhen running preview, you can edit stage properties to see how the changes affect preview data. For example, you might edit the condition in a Stream Selector processor to see how the condition alters which records pass to the different output streams.
OverviewWhen Transformer runs a pipeline, you can view real-time statistics about the pipeline.
Pipeline and Stage StatisticsWhen you monitor a pipeline, you can view real-time summary statistics for the pipeline and for stages in the pipeline.
Cluster and Spark URLsIn monitor mode, the Monitoring panel provides URLs for the cluster or the Spark application that runs the pipeline.
Pipeline Run HistoryYou can view the run history of a pipeline when you configure or monitor a pipeline. View the run history from either the Summary or History tab.
Viewing Transformer DirectoriesYou can view the directories that Transformer uses. You might check the directories being used to access a file in the directory or to increase the amount of available space for a directory.
Shutting Down TransformerYou can shut down and then manually launch Transformer to apply changes to the Transformer configuration file, environment configuration file, or user logins.
Restarting TransformerYou can restart Transformer to apply changes to the Transformer configuration file, environment configuration file, or user logins. During the restart process, Transformer shuts down and then automatically restarts.
Opting Out of Usage Statistics CollectionYou can help to improve Transformer by allowing StreamSets to collect usage statistics about Transformer system performance and features that you use. This information helps StreamSets to improve product performance and to make product development decisions.