StreamSets Platform - Transformer Engine Guide

Transformer functions as a Spark client that launches distributed Spark applications.

Transformer can run pipelines in batch mode. A batch pipeline processes all available data in a single batch, and then stops.

Streaming Case Study

Transformer can run pipelines in streaming mode. A streaming pipeline maintains connections to origin systems and processes data at user-defined intervals. The pipeline runs continuously until you manually stop it.

Tutorials and Sample Pipelines

StreamSets provides tutorials and sample pipelines to help you learn about using Transformer.

Installation Requirements for Self-Managed Deployments

General Installation Requirements

Granting the Spark Cluster Access to Transformer

Protecting Sensitive Data in Configuration Properties

Java and Security Configuration

Enabling HTTPS

Using a Proxy Server

Credential Stores

Stage-Related Prerequisites

Amazon EMR Serverless

Cloudera Data Engineering

Databricks

Google Dataproc

Hadoop YARN

SQL Server 2019 Big Data Cluster

Pipeline Design

What is a Transformer Pipeline?

A Transformer pipeline describes the flow of data from origin systems to destination systems and defines how to transform the data along the way.

Sample Pipelines

Transformer provides sample pipelines that you can use to learn about Transformer features or as a template for building your own pipelines.

Stage Library Match Requirement

Local Pipelines

Typically, you run a Transformer pipeline on a cluster. You can also run a pipeline on a Spark installation on the Transformer machine. This is known as a local pipeline.

Spark Executors

A Transformer pipeline runs on one or more Spark executors.

Partitioning

When you start a pipeline, StreamSets Transformer launches a Spark application. Spark runs the application just as it runs any other application, splitting the pipeline data into partitions and performing operations on the partitions in parallel.

Offset Handling

Batch Header Attributes

Batch header attributes are attributes in batch headers that you can use in pipeline logic.

Delivery Guarantee

Transformer's offset handling ensures that, in times of sudden failures, a Transformer pipeline does not lose data - it processes data at least once. If a sudden failure occurs at a particular time, up to one batch of data may be reprocessed. This is an at-least-once delivery guarantee.

Caching Data

You can configure most origins and processors to cache data. You might enable caching when a stage passes data to more than one downstream stage.

Performing Lookups

Expressions in Pipeline and Stage Properties

Data Types

Deprecated Functionality

Pipeline Configuration

Execution Mode

Cluster Callback URL

Preprocessing Script

Extra Spark Configuration

Ludicrous Processing Mode