IBM StreamSets - Data Collector Engine Guide

6.1.x Release Notes

6.0.x Release Notes

5.12.x Release Notes

5.11.x Release Notes

5.10.x Release Notes

5.9.x Release Notes

5.8.x Release Notes

5.7.x Release Notes

5.6.x Release Notes

5.5.x Release Notes

5.4.x Release Notes

5.3.x Release Notes

5.2.x Release Notes

5.1.x Release Notes

5.0.x Release Notes

4.4.x Release Notes

4.3.x Release Notes

4.2.x Release Notes

4.1.x Release Notes

4.0.x Release Notes

What is IBM StreamSets?

Installation Overview

Requirements for Self-Managed Deployments

Stage Libraries

Supported Systems and Versions

MapR Prerequisites

Data Collector Configuration

Using a Proxy Server

Java and Security Configuration

Install External Libraries

Custom Stage Libraries

Credential Stores

Working with Data Governance Tools

Enabling External JMX Tools

Pre Upgrade Tasks

Post Upgrade Tasks

Pipeline Concepts and Design

What is a Pipeline?

Designing the Data Flow

Dropping Unwanted Records

Error Record Handling

Record Header Attributes

Field Attributes

Processing Changed Data

Control Character Removal

Development Stages

Shortcut Keys for Pipeline Design

Test Origin for Preview

Resetting the Origin

Understanding Pipeline States

Technology Preview Functionality

Pipeline Configuration

Retrying the Pipeline

Advanced Options

Simple and Bulk Edit Mode

Event Generation

SSL/TLS Encryption

Security in Amazon Stages

Security in Google Cloud Stages

Security in Kafka Stages

Kafka Message Keys

SSL/TLS in CONNX Stages

Authentication in Salesforce Stages

Expression Configuration

Configuring a Pipeline

Data Formats Overview

Avro Data Format

Binary Data Format

Datagram Data Format

Delimited Data Format

Excel Data Format

Log Data Format

NetFlow Data Processing

Parquet Data Format

Protobuf Data Format Prerequisites

SDC Record Data Format

Text Data Format with Custom Delimiters

Whole File Data Format

Reading and Processing XML Data

Writing XML Data

Amazon SQS Consumer

Aurora PostgreSQL CDC Client

Azure Blob Storage

Azure Data Lake Storage Gen2

Azure Data Lake Storage Gen2 (Legacy)

Azure IoT/Event Hub Consumer

Google BigQuery

Google Cloud Storage

Google Pub/Sub Subscriber

Groovy Scripting

Hadoop FS Standalone

JavaScript Scripting

JDBC Multitable Consumer

JDBC Query Consumer

Jython Scripting

Kafka Multitopic Consumer

Kinesis Consumer

MapR FS Standalone

MapR Multitopic Streams Consumer

MapR Streams Consumer

MongoDB Atlas CDC

MQTT Subscriber

MySQL Binary Log

Oracle Bulkload

Oracle Multitable Consumer

Oracle CDC Client

PostgreSQL CDC Client

Pulsar Consumer

Pulsar Consumer (Legacy)

RabbitMQ Consumer

Salesforce Bulk API 2.0

SAP HANA Query Consumer

SFTP/FTP/FTPS Client

SQL Server CDC Client

SQL Server Change Tracking

UDP Multithreaded Source

WebSocket Client

WebSocket Server

Base64 Field Decoder

Base64 Field Encoder

Control Hub API

Couchbase Lookup

Encrypt and Decrypt Fields

Expression Evaluator

Field Flattener

Field Type Converter

Groovy Evaluator

JavaScript Evaluator

Jython Evaluator

Kaitai Struct Parser

MLeap Evaluator

MongoDB Atlas Lookup

The MongoDB Atlas Lookup processor performs lookups in MongoDB Atlas or MongoDB Enterprise Server and passes all values from the returned document to a new list-map field in the record.

PostgreSQL Metadata

Record Deduplicator

Salesforce Bulk API 2.0 Lookup

Salesforce Lookup

Schema Generator

Stream Selector

TensorFlow Evaluator

Whole File Transformer

Windowing Aggregator

Aerospike Client

Azure Blob Storage

Azure Data Lake Storage Gen2

Azure Event Hub Producer

Azure IoT Hub Producer

Azure Synapse SQL

Databricks Delta Lake

Google BigQuery

Google Bigtable

Google Cloud Storage

Google Pub/Sub Publisher

Kinesis Firehose

Kinesis Producer

MapR Streams Producer

Pulsar Producer

RabbitMQ Producer

Salesforce Bulk API 2.0

Send Response to Origin

SFTP/FTP/FTPS Client

Snowflake File Uploader

WebSocket Client

ADLS Gen2 File Metadata

Databricks Job Launcher

Databricks Query

Google Cloud Storage

Google BigQuery

HDFS File Metadata

MapR FS File Metadata

Pipeline Finisher

SFTP/FTP/FTPS Client

Dataflow Triggers

Dataflow Triggers Overview

Pipeline Event Generation

Stage Event Generation

Logical Pairings

Viewing Events in Data Preview and Snapshot

Executing Pipeline Events in Data Preview

Solutions Overview

Converting Data to the Parquet Data Format

Automating Impala Metadata Updates for Drift Synchronization for Hive

Managing Output Files

Stopping a Pipeline After Processing All Available Data

Offloading Data from Relational Sources to Hadoop

Sending Email During Pipeline Processing

Preserving an Audit Trail of Events

Loading Data into Databricks Delta Lake

Drift Synchronization Solution for Hive

Drift Synchronization Solution for PostgreSQL

Multithreaded Pipelines

Multithreaded Pipeline Overview

Tuning Threads and Runners

Multithreaded Pipeline Summary

Microservice Pipelines

Microservice Pipelines

A microservice pipeline is a pipeline that creates a fine-grained service to perform a specific task.

Stages for Microservice Pipelines

Sample Pipeline

Creating a Microservice Pipeline

Orchestration Pipelines

Orchestration Pipeline Overview

Orchestration Stages

Orchestration Record

Sample Pipeline

Rules and Alerts

Rules and Alerts Overview

Metric Rules and Alerts

Data Rules and Alerts

Data Drift Rules and Alerts

Configuring Email for Alerts

Tutorial Overview

Before You Begin

Extended Tutorial

Troubleshooting

Accessing Error Messages

Pipeline Basics

JDBC Connections

Glossary of Terms

Data Formats by Stage

Data Format Support

Expression Language

Expression Language

Datetime Variables

Regular Expressions

Regular Expressions Overview

Regular Expressions in the Pipeline

Quick Reference

Defining Grok Patterns

General Grok Patterns

Date and Time Grok Patterns

Java Grok Patterns

Log Grok Patterns

Networking Grok Patterns

Path Grok Patterns

© Copyright IBM Corporation