Data Collectors Overview
Data Collector is an execution engine that works directly with Control Hub. You install Data Collectors in your corporate network, which can be on-premises or on a protected cloud computing platform, and then register them to work with Control Hub.
Each registered Data Collector serves as either an authoring or an execution Data Collector. Use an authoring Data Collector to design pipelines and to create connections. Use an execution Data Collector to execute standalone and cluster pipelines run from Control Hub jobs.
A single Data Collector can serve both purposes. However, StreamSets recommends dedicating each Data Collector as either an authoring or execution Data Collector.
Control Hub monitors the resources that each Data Collector uses. Control Hub only starts jobs on a Data Collector that has not reached any resource thresholds.
Install and register Data Collectors in one of the following ways:
- Manually administer
Install individual Data Collectors and then register them to work with Control Hub. Manually administer and upgrade each Data Collector individually.
You might want to install and manually administer a small number of authoring Data Collectors used to design pipelines. You can manually administer a single authoring Data Collector that all data engineers can use. Or you can manually administer multiple authoring Data Collectors, granting only a few data engineers access to each instance.- Automatically provision
-
Automatically provision Data Collectors on a Kubernetes container orchestration framework. Create a Control Hub Provisioning Agent for each Kubernetes cluster where you want to provision Data Collectors. A Provisioning Agent is a containerized application that runs in your Kubernetes cluster. It uses deployments that you configure to automatically provision Data Collector Docker containers in the cluster - including deploying, registering, starting, scaling, and stopping the Data Collector containers.
All registered Data Collectors - either manually administered or automatically provisioned - function in the same way.
Example
You might find that you will manually administer a single authoring Data Collector used to design pipelines and then automatically provision a larger number of execution Data Collectors.
- Manually administer a single authoring Data Collector
- Your organization has four data engineers who use the Control Hub Pipeline Designer to design pipelines. You install Data Collector on an on-premises machine. You register this Data Collector with Control Hub and assign the Authoring label to this Data Collector.
- Automatically provision multiple execution Data Collectors
- Your organization has multiple data centers located in different geographic regions. Each data center has its own Kubernetes cluster. You want to automatically provision Data Collectors in the Kubernetes clusters because you want to reduce the overhead of managing and upgrading a large number of Data Collector installations. In addition, you'll need to regularly scale the number of Data Collectors running jobs during peak performance times.
Data Collector Versions
StreamSets recommends using the latest version of Data Collector with Control Hub to ensure that you can use the newest features.
You can register earlier Data Collector versions. You can even register Data Collectors of different versions. However, since Data Collector functionality can differ from version to version, use an authoring Data Collector that is the same version as the execution Data Collectors that you intend to use to run the pipeline. Using a different Data Collector version can result in a pipeline that is invalid for execution Data Collectors.
For example, if the authoring Data Collector is a more recent version than the execution Data Collector, pipelines might include a stage, stage library, or stage functionality that does not exist in the execution Data Collector.
In addition, ensure that all execution Data Collectors assigned the same label are the same Data Collector version.
When you start a job on a group of Data Collectors with the same label, any of the Data Collectors can run a pipeline instance for the job. As a result, all Data Collectors that function as a group must use the same Data Collector version and must have an identical configuration to ensure consistent processing.
For the minimum Data Collector versions that Control Hub supports, see Control Hub Requirements.