What is the StreamSets Platform?

The StreamSets platform is a cloud-native platform for building, running, and monitoring data pipelines.

A pipeline describes the flow of data from origin to destination systems and defines how to process the data along the way. Pipelines can access multiple types of external systems, including cloud data lakes, cloud data warehouses, and storage systems installed on-premises such as relational databases.

As a pipeline runs, you can view real-time statistics and error information about the data as it flows from origin to destination systems.

The StreamSets platform uses the following components to manage your pipelines:
Control plane
The StreamSets control plane consists of StreamSets Control Hub, a public cloud service hosted by StreamSets that you access using a web browser. Use Control Hub to build, manage, and monitor your pipelines.
Data plane
The StreamSets data plane provides the following engines to process data:
  • Data Collector - Use to run data ingestion pipelines that can read from and write to a large number of heterogeneous origins and destinations. Data Collector pipelines perform record-based data transformations in streaming, CDC, or batch modes.
  • Transformer - Use to run data processing pipelines on Apache Spark. Because Transformer pipelines run on Spark deployed on a cluster, the pipelines can perform set-based transformations such as joins, aggregates, and sorts on the entire data set.
  • Transformer for Snowflake - Use to run pipelines that process Snowflake data using Snowpark client libraries. Transformer for Snowflake enables you to design and perform complex processing in Snowflake without having to write SQL queries or templates.
You deploy Data Collector and Transformer engines in your corporate network, either on-premises or on a protected cloud computing platform.
Most organizations use the Transformer for Snowflake engine hosted and managed by StreamSets. Based on the account agreement for your organization, you can deploy Transformer for Snowflake engines as you do other engine types.
When you start a pipeline from Control Hub, the engine uses the pipeline configuration to process the data. It also sends status updates and metrics about the running pipelines back to Control Hub so you can monitor the pipeline progress in real time.

The following image provides a general overview of the StreamSets platform components when using deployed engines: