IBM StreamSets - Data Collector Engine Guide
Index
Search
Release Notes
5.12.x Release Notes
5.11.x Release Notes
5.10.x Release Notes
5.9.x Release Notes
5.8.x Release Notes
5.7.x Release Notes
5.6.x Release Notes
5.5.x Release Notes
5.4.x Release Notes
5.3.x Release Notes
5.2.x Release Notes
5.1.x Release Notes
5.0.x Release Notes
4.4.x Release Notes
4.3.x Release Notes
4.2.x Release Notes
4.1.x Release Notes
4.0.x Release Notes
Installation
What is IBM StreamSets?
Installation Overview
Requirements
for Self-Managed Deployments
Stage Libraries
Supported Systems and Versions
MapR Prerequisites
Configuration
Enabling HTTPS
Data Collector Configuration
Using a Proxy Server
Java and Security
Configuration
Install External Libraries
Custom Stage Libraries
Credential Stores
Working with Data Governance Tools
Enabling External JMX Tools
Upgrade
Upgrade
Post Upgrade Tasks
Pipeline Concepts and Design
What is a Pipeline?
Data in Motion
Designing the Data Flow
Dropping Unwanted Records
Error Record Handling
Record Header Attributes
Field Attributes
Processing Changed Data
Control Character Removal
Development Stages
Shortcut Keys for Pipeline Design
Test Origin for Preview
Resetting the Origin
Understanding Pipeline States
Technology Preview Functionality
Deprecated Functionality
Pipeline Configuration
Retrying the Pipeline
Rate Limit
Advanced Options
Simple and Bulk Edit Mode
Runtime Values
Event Generation
Webhooks
Notifications
SSL/TLS Encryption
Security in Amazon Stages
Security in Google Cloud Stages
Security in Kafka Stages
Kafka Message Keys
SSL/TLS in CONNX Stages
Authentication in Salesforce Stages
Expression Configuration
Configuring a Pipeline
Data Formats
Data Formats Overview
Avro Data Format
Binary Data Format
Datagram Data Format
Delimited Data Format
Excel Data Format
Log Data Format
NetFlow Data Processing
Parquet Data Format
Protobuf Data Format Prerequisites
SDC Record Data Format
Text Data Format with Custom Delimiters
Whole File Data Format
Reading and Processing XML Data
Writing XML Data
Origins
Origins
Amazon S3
Amazon SQS Consumer
Aurora PostgreSQL CDC Client
Azure Blob Storage
Azure Data Lake Storage Gen2
Azure Data Lake Storage Gen2 (Legacy)
Azure IoT/Event Hub Consumer
CoAP Server
CONNX
CONNX CDC
Cron Scheduler
Couchbase
Directory
Elasticsearch
File Tail
Google BigQuery
Google Cloud Storage
Google Pub/Sub Subscriber
Groovy Scripting
Hadoop FS Standalone
HTTP Client
HTTP Server
JavaScript Scripting
JDBC Multitable Consumer
JDBC Query Consumer
Jira
JMS Consumer
Jython Scripting
Kafka Multitopic Consumer
Kinesis Consumer
MapR DB CDC
MapR DB JSON
MapR FS Standalone
MapR Multitopic Streams Consumer
MapR Streams Consumer
MongoDB
MongoDB Atlas
MongoDB Atlas CDC
MongoDB Oplog
MQTT Subscriber
MySQL Binary Log
OPC UA Client
Oracle Bulkload
Oracle Multitable Consumer
Oracle CDC
Oracle CDC Client
PostgreSQL CDC Client
Pulsar Consumer
Pulsar Consumer (Legacy)
RabbitMQ Consumer
Redis Consumer
REST Service
Salesforce
Salesforce Bulk API 2.0
SAP HANA Query Consumer
SFTP/FTP/FTPS Client
Snowflake Bulk
SQL Server CDC Client
SQL Server Change Tracking
Start Jobs
TCP Server
UDP Multithreaded Source
UDP Source
Web Client
WebSocket Client
WebSocket Server
Processors
Processors
Base64 Field Decoder
Base64 Field Encoder
Control Hub API
Couchbase Lookup
Data Generator
Data Parser
Delay
Encrypt and Decrypt Fields
Expression Evaluator
Field Flattener
Field Hasher
Field Mapper
Field Masker
Field Merger
Field Order
Field Pivoter
Field Remover
Field Renamer
Field Replacer
Field Splitter
Field Type Converter
Field Zip
Geo IP
Groovy Evaluator
HBase Lookup
Hive Metadata
HTTP Client
HTTP Router
JavaScript Evaluator
JDBC Lookup
JDBC Tee
JSON Generator
JSON Parser
Jython Evaluator
Kaitai Struct Parser
Kudu Lookup
Log Parser
MLeap Evaluator
MongoDB Atlas Lookup
The MongoDB Atlas Lookup processor performs lookups in MongoDB Atlas or MongoDB Enterprise Server and passes all values from the returned document to a new list-map field in the record.
MongoDB Lookup
PMML Evaluator
PostgreSQL Metadata
Record Deduplicator
Redis Lookup
Salesforce Bulk API 2.0 Lookup
Salesforce Lookup
Schema Generator
SQL Parser
Start Jobs
Static Lookup
Stream Selector
TensorFlow Evaluator
Wait for Jobs
Web Client
Whole File Transformer
Windowing Aggregator
XML Flattener
XML Parser
Destinations
Destinations
Aerospike Client
Amazon S3
Azure Blob Storage
Azure Data Lake Storage Gen2
Azure Event Hub Producer
Azure IoT Hub Producer
Azure Synapse SQL
Cassandra
CoAP Client
Couchbase
Databricks Delta Lake
Elasticsearch
Google BigQuery
Google Bigtable
Google Cloud Storage
Google Pub/Sub Publisher
Hadoop FS
HBase
Hive Metastore
HTTP Client
InfluxDB
InfluxDB 2.x
JDBC Producer
Jira
JMS Producer
Kafka Producer
Kinesis Firehose
Kinesis Producer
Kudu
Local FS
MapR DB
MapR DB JSON
MapR FS
MapR Streams Producer
MongoDB
MongoDB Atlas
MQTT Publisher
Named Pipe
Oracle
Pulsar Producer
RabbitMQ Producer
Redis
Salesforce
Salesforce Bulk API 2.0
Send Response to Origin
SFTP/FTP/FTPS Client
SingleStore
Snowflake
Snowflake File Uploader
Solr
Splunk
Syslog
Tableau CRM
Teradata
To Error
Trash
Web Client
WebSocket Client
Executors
Executors
ADLS Gen2 File Metadata
Amazon S3
Databricks Job Launcher
Databricks Query
Email
Google Cloud Storage
Google BigQuery
HDFS File Metadata
Hive Query
JDBC Query
MapR FS File Metadata
MapReduce
Pipeline Finisher
SFTP/FTP/FTPS Client
Shell
Snowflake
Spark
Dataflow Triggers
Dataflow Triggers Overview
Pipeline Event Generation
Stage Event Generation
Executors
Logical Pairings
Event Records
Viewing Events in Data Preview
and
Snapshot
Executing Pipeline Events in Data Preview
Summary
Solutions
Solutions Overview
Converting Data to the Parquet Data Format
Automating Impala Metadata Updates for Drift Synchronization for Hive
Managing Output Files
Stopping a Pipeline After Processing All Available Data
Offloading Data from Relational Sources to Hadoop
Sending Email During Pipeline Processing
Preserving an Audit Trail of Events
Loading Data into Databricks Delta Lake
Drift Synchronization Solution for Hive
Drift Synchronization Solution for PostgreSQL
Multithreaded Pipelines
Multithreaded Pipeline Overview
How It Works
Monitoring
Tuning Threads and Runners
Resource Usage
Multithreaded Pipeline Summary
Microservice Pipelines
Microservice Pipelines
A microservice pipeline is a pipeline that creates a fine-grained service to perform a specific task.
Stages for Microservice Pipelines
Sample Pipeline
Creating a Microservice Pipeline
Orchestration Pipelines
Orchestration Pipeline Overview
Orchestration Stages
Orchestration Record
Sample Pipeline
Rules and Alerts
Rules and Alerts Overview
Metric Rules and Alerts
Data Rules and Alerts
Data Drift Rules and Alerts
Alert Webhooks
Configuring Email for Alerts
Tutorial
Tutorial Overview
Before You Begin
Basic Tutorial
Extended Tutorial
Troubleshooting
Accessing Error Messages
Pipeline Basics
Origins
Processors
Destinations
Executors
JDBC Connections
Performance
Error Codes
Error Codes
Glossary
Glossary of Terms
Data Formats by Stage
Data Format Support
Origins
Processors
Destinations
Expression Language
Expression Language
Functions
Constants
Datetime Variables
Literals
Operators
Reserved Words
Regular Expressions
Regular Expressions Overview
Regular Expressions in the Pipeline
Quick Reference
Regex Examples
Grok Patterns
Defining Grok Patterns
General Grok Patterns
Date and Time Grok Patterns
Java Grok Patterns
Log Grok Patterns
Networking Grok Patterns
Path Grok Patterns
© 2023 StreamSets, Inc.