IBM StreamSets - Transformer Engine Guide

Getting Started

What is IBM StreamSets for Apache Spark?

Pipeline Processing on Spark

Transformer functions as a Spark client that launches distributed Spark applications.

Batch Case Study

Transformer can run pipelines in batch mode. A batch pipeline processes all available data in a single batch, and then stops.

Streaming Case Study

Transformer can run pipelines in streaming mode. A streaming pipeline maintains connections to origin systems and processes data at user-defined intervals. The pipeline runs continuously until you manually stop it.

Tutorials and Sample Pipelines

StreamSets provides tutorials and sample pipelines to help you learn about using Transformer.

6.0.x Release Notes

5.9.x Release Notes

5.8.x Release Notes

5.7.x Release Notes

5.6.x Release Notes

5.5.x Release Notes

5.4.x Release Notes

5.3.x Release Notes

5.2.x Release Notes

5.1.x Release Notes

5.0.x Release Notes

4.3.x Release Notes

4.2.x Release Notes

4.1.x Release Notes

4.0.x Release Notes

Installation Requirements for Self-Managed Deployments

General Installation Requirements

Granting the Spark Cluster Access to Transformer

Understanding the Spark Cluster Callback URL

Accessing Log File Information

Protecting Sensitive Data in Configuration Properties

Java and Security Configuration

Using a Proxy Server

Credential Stores

Stage-Related Prerequisites

External Libraries

Post Upgrade Tasks

Managing Cluster Version Changes

Amazon EMR Serverless

Cloudera Data Engineering

Google Dataproc

Pipeline Design

What is a Transformer Pipeline?

Sample Pipelines

Stage Library Match Requirement

Local Pipelines

Spark Executors

Offset Handling

Batch Header Attributes

Delivery Guarantee

Performing Lookups

Expressions in Pipeline and Stage Properties

Deprecated Functionality

Pipeline Configuration

Cluster Callback URL

Preprocessing Script

Extra Spark Configuration

Ludicrous Processing Mode

Simple and Bulk Edit Mode

Cache Levels and Replicas

Amazon Security

Security in Kafka Stages

Kafka Message Keys

Configuring a Pipeline

Schema Inference

Azure Event Hubs

Google Big Query

MySQL JDBC Table

Oracle JDBC Table

PostgreSQL JDBC Table

SQL Server JDBC Table

Whole Directory

Delta Lake Lookup

Field Flattener

Slowly Changing Dimension

Snowflake Lookup

Spark SQL Expression

Spark SQL Query

Stream Selector

Surrogate Key Generator

Amazon Redshift

Azure Event Hubs

Google Big Query

Troubleshooting

StreamSets Expression Language

StreamSets Expression Language

Datetime Variables

© Copyright IBM Corporation