Data Collectors Overview

Data Collector is an execution engine that works directly with Control Hub. You install Data Collectors in your corporate network, which can be on-premises or on a protected cloud computing platform, and then register them to work with Control Hub.

Each registered Data Collector serves as either an authoring or an execution Data Collector. Use an authoring Data Collector to design pipelines and to create connections. Use an execution Data Collector to execute standalone and cluster pipelines run from Control Hub jobs.

A single Data Collector can serve both purposes. However, StreamSets recommends dedicating each Data Collector as either an authoring or execution Data Collector.

Control Hub monitors the resources that each Data Collector uses. Control Hub only starts jobs on a Data Collector that has not reached any resource thresholds.

Install and register Data Collectors in one of the following ways:

Manually administer

Install individual Data Collectors and then register them to work with Control Hub. Manually administer and upgrade each Data Collector individually.

You might want to install and manually administer a small number of authoring Data Collectors used to design pipelines. You can manually administer a single authoring Data Collector that all data engineers can use. Or you can manually administer multiple authoring Data Collectors, granting only a few data engineers access to each instance.
Automatically provision

Automatically provision Data Collectors on a Kubernetes container orchestration framework. Create a Control Hub Provisioning Agent for each Kubernetes cluster where you want to provision Data Collectors. A Provisioning Agent is a containerized application that runs in your Kubernetes cluster. It uses deployments that you configure to automatically provision Data Collector Docker containers in the cluster - including deploying, registering, starting, scaling, and stopping the Data Collector containers.

Use provisioning to reduce the overhead of managing a large number of Data Collector installations. Instead, you can manage a central Kubernetes cluster used to run multiple Data Collector containers.

Provisioning is especially useful when you require a large number of execution Data Collectors to run jobs. When you provision Data Collectors, you benefit from all of the features that both Docker and Kubernetes offer - including easily scaling Data Collector containers and updating Data Collector containers to a new image with a different Data Collector version or with different configurations.

For more information about provisioning, see Provisioned Data Collectors Overview.

All registered Data Collectors - either manually administered or automatically provisioned - function in the same way.


You might find that you will manually administer a single authoring Data Collector used to design pipelines and then automatically provision a larger number of execution Data Collectors.

Let's look at an example of how you might want to use both methods of registering Data Collectors:
Manually administer a single authoring Data Collector
Your organization has four data engineers who use the Control Hub Pipeline Designer to design pipelines. You install Data Collector on an on-premises machine. You register this Data Collector with Control Hub and assign the Authoring label to this Data Collector.
The data engineers select this authoring Data Collector in Pipeline Designer to design, preview, and test their pipelines. When the data engineers finish designing pipelines, they publish the pipelines for job execution.
Automatically provision multiple execution Data Collectors
Your organization has multiple data centers located in different geographic regions. Each data center has its own Kubernetes cluster. You want to automatically provision Data Collectors in the Kubernetes clusters because you want to reduce the overhead of managing and upgrading a large number of Data Collector installations. In addition, you'll need to regularly scale the number of Data Collectors running jobs during peak performance times.
You create one Provisioning Agent for the Kubernetes cluster in the western region data center and another Provisioning Agent for the Kubernetes cluster in the eastern region data center. You create one deployment for each Provisioning Agent - assigning the label WestDataCenter to one deployment and the label EastDataCenter to the other deployment.
When the DevOps engineers create jobs, they select the appropriate data center label to ensure that the jobs are started on the group of Data Collector containers deployed to the Kubernetes cluster for that data center.
During times of high performance, DevOps engineers simply modify a provisioning deployment to scale the number of Data Collectors up. The Provisioning Agent automatically deploys, registers, and starts additional Data Collector containers that run additional remote pipeline instances for jobs running on that deployment.

Data Collector Versions

StreamSets recommends using the latest version of Data Collector with Control Hub to ensure that you can use the newest features.

You can register earlier Data Collector versions. You can even register Data Collectors of different versions. However, since Data Collector functionality can differ from version to version, use an authoring Data Collector that is the same version as the execution Data Collectors that you intend to use to run the pipeline. Using a different Data Collector version can result in a pipeline that is invalid for execution Data Collectors.

For example, if the authoring Data Collector is a more recent version than the execution Data Collector, pipelines might include a stage, stage library, or stage functionality that does not exist in the execution Data Collector.

In addition, ensure that all execution Data Collectors assigned the same label are the same Data Collector version.

When you start a job on a group of Data Collectors with the same label, any of the Data Collectors can run a pipeline instance for the job. As a result, all Data Collectors that function as a group must use the same Data Collector version and must have an identical configuration to ensure consistent processing.

For the minimum Data Collector versions that Control Hub supports, see Control Hub Requirements.