What is IBM StreamSets?

IBM StreamSets is a cloud-native platform for building, running, and monitoring data pipelines.

A pipeline describes the flow of data from origin to destination systems and defines how to process the data along the way. Pipelines can access multiple types of external systems, including cloud data lakes, cloud data warehouses, and storage systems installed on-premises such as relational databases.

As a pipeline runs, you can view real-time statistics and error information about the data as it flows from origin to destination systems.

IBM StreamSets uses the following components to manage your pipelines:
Control Hub
Control Hub is a public cloud service that you access using a web browser. Use Control Hub to build, manage, and monitor your pipelines.
Data Collector
Data Collector is an engine that processes data. Use the engine to run data ingestion pipelines that can read from and write to a large number of heterogeneous origins and destinations. Data Collector pipelines perform record-based data transformations in streaming, CDC, or batch modes.
You deploy Data Collector engines in your corporate network, which can be on-premises or on a protected cloud computing platform.
When you start a pipeline from Control Hub, the engine uses the pipeline configuration to process the data. The engine sends status updates and metrics about the running pipeline back to Control Hub so that you can monitor the pipeline progress in real time.
Note: At times, this documentation uses "StreamSets" to refer to "IBM StreamSets".

The following image provides a general overview of the IBM StreamSets components: