Pipeline Processing on Spark
Transformer functions as a Spark client that launches distributed Spark applications.
Batch Case Study
Transformer can run pipelines in batch mode. A batch pipeline processes all available data in a single batch, and then
stops.
Streaming Case Study
Transformer can run pipelines in streaming mode. A streaming pipeline maintains connections to origin systems and processes
data at user-defined intervals. The pipeline runs continuously until you manually stop it.
Tutorials and Sample Pipelines
StreamSets provides tutorials and sample pipelines to help you learn about using Transformer.
What is a Transformer Pipeline?
A Transformer pipeline describes the flow of data from origin systems to destination systems and defines how to transform
the data along the way.
Sample Pipelines
Transformer provides sample pipelines that you can use to learn about Transformer features or as a template for building your own pipelines.
Local Pipelines
Typically, you run a Transformer pipeline on a cluster. You can also run a pipeline on a Spark installation on the Transformer machine. This is known as a local pipeline.
Spark Executors
A Transformer pipeline runs on one or more Spark executors.
Partitioning
When you start a pipeline, StreamSets Transformer launches a Spark application. Spark runs the application just as it runs any other application, splitting the pipeline
data into partitions and performing operations on the partitions in parallel.
Batch Header Attributes
Batch header attributes are attributes in batch headers that you can use in pipeline logic.
Delivery Guarantee
Transformer's offset handling ensures that, in times of sudden failures, a Transformer pipeline does not lose data - it processes data at least once. If a sudden failure occurs at a particular time, up
to one batch of data may be reprocessed. This is an at-least-once delivery guarantee.
Caching Data
You can configure most origins and processors to cache data. You might enable caching when a stage passes data to
more than one downstream stage.
Overview
You can preview data to help build or fine-tune a pipeline. You can preview complete or incomplete pipelines.
Preview Codes
In Preview mode, Transformer displays different colors for different types of data. Transformer uses other codes and formatting to highlight changed fields.
Processor Output Order
When previewing data for a processor, you can preview both the input and the output data. You can display the output
records in the order that matches the input records or in the order produced by the processor.
Input and Output Schema for Stages
After running preview for a pipeline, you can view the input and output schema for each stage on the Schema tab in
the pipeline properties panel. The schema includes each field path and data type.
Editing Properties
When running preview, you can edit stage properties to see how the changes affect preview data. For example, you might
edit the condition in a Stream Selector processor to see how the condition alters which records pass to the different
output streams.