What is a Pipeline?

A pipeline describes the flow of data from the origin system to destination systems and defines how to transform the data along the way.

You can use a single origin stage to represent the origin system, multiple processor stages to transform data, and multiple destination stages to represent destination systems.

When you develop a pipeline, you can use development stages to provide sample data and generate errors to test error handling. And you can use data preview to determine how stages alter the data through the pipeline.

You can use executor stages to perform event-triggered task execution or to save event information. To process large volumes of data, you can use multithreaded pipelines or cluster mode pipelines.

In pipelines that write to Hive or parquet or to PostgreSQL, you can implement a data drift solution that detects drift in incoming data and updates tables in destination systems.

When you are done with pipeline development, you can publish the pipeline and create a job to execute the dataflow defined in the pipeline. When you start a job, Control Hub runs the job on the available Data Collectors associated with the job. For more information about jobs, see Jobs Overview.