Provisioned Data Collectors Overview

You can automatically provision Data Collector containers on a container orchestration framework in your environment, such as Kubernetes.

Provisioning Data Collectors involves the following components:
Data Collector Docker image
Customize the public StreamSets Data Collector Docker image for your configuration requirements. For example, you might need to modify the Data Collector configuration files, install external libraries, or store custom stage libraries. Use Docker to customize the public Data Collector Docker image and then store the private image in your private repository.
Provisioning Agent
A Provisioning Agent is a containerized application that runs in a Kubernetes container orchestration framework. The agent communicates with Control Hub to automatically provision Data Collector containers in the Kubernetes cluster in which it runs. Provisioning includes deploying, registering, starting, scaling, and stopping the Data Collector containers. You can configure the Provisioning Agent to provision Data Collector containers enabled for Kerberos authentication.
You can use Helm charts or Kubernetes commands to create and deploy a Provisioning Agent as a containerized application to a Kubernetes pod. StreamSets recommends using Helm because it is a tool that streamlines installing and managing Kubernetes applications.

Create one Provisioning Agent for each Kubernetes cluster where you want to provision Data Collectors. For example, if you have a production cluster and a disaster recovery cluster, you would create a total of two Provisioning Agents - one for each cluster.

A deployment is a logical grouping of Data Collector containers deployed by a Provisioning Agent to Kubernetes. All Data Collector containers in a deployment are identical and highly available.
Define a deployment YAML specification file, and then use Control Hub to create a deployment with that specification. The deployment specification file can optionally associate a Kubernetes Horizontal Pod Autoscaler, service, or Ingress with the deployment.
When you create the deployment, you specify the Provisioning Agent that manages the deployment, the number of container instances that you want to run, and the labels to assign to the deployed Data Collector containers. You can change that information later by updating your deployment.
When you start a deployment, the Provisioning Agent deploys the Data Collector containers, creating a Kubernetes pod to host each Data Collector container. The agent also registers each Data Collector container with Control Hub.

You can create multiple deployments for a single Provisioning Agent. For example, for the Provisioning Agent running in the production cluster, you might create one deployment dedicated to running jobs that read web server logs and another deployment dedicated to running jobs that read data from Google Cloud.

Provisioning is especially useful when you require a large number of execution Data Collectors to run jobs. When you provision Data Collectors, you benefit from all of the features that both Docker and Kubernetes offer - including easily scaling Data Collector containers and updating Data Collector containers to a new image with a different Data Collector version or with different configurations.

You can also automatically provision an authoring Data Collector dedicated to pipeline design as long as the authoring Data Collector is provisioned from a unique deployment that doesn't include any execution Data Collectors.