Provision Data Collectors

To provision Data Collectors, complete the following steps:

Step 1. Create a Custom Image

Use Docker to customize the public StreamSets Data Collector Docker image as needed, and then store the private image in your private repository.

Include the following customizations in the private image, based on your requirements:

Customized configuration files
Resource files
External libraries, such as JDBC drivers
Custom stages
Additional stage libraries - The public Data Collector Docker image includes the basic, development, and Windows stage libraries only.
Packages and files required to enable Kerberos authentication for Data Collector:
- On Linux, the krb5-workstation and krb5-client Kerberos client packages.
- The Hadoop or HDFS configuration files required by the Kerberos-enabled stage, for example:
  - core-site.xml
  - hdfs-site.xml
  - yarn-site.xml
  - mapred-site.xml

Each deployment managed by a Provisioning Agent specifies the Data Collector Docker image to deploy. So you can create a unique Data Collector Docker image for each deployment, or you can use one Docker image for all deployments.

For example, let's say that one deployment of provisioned Data Collectors reads from web server logs, so the Data Collector Docker image used by that deployment requires only the basic and statistics stage libraries. Another deployment of provisioned Data Collectors reads from the Google Cloud platform, so the Data Collector Docker image used by that deployment requires the Google Cloud stage library in addition to the basic and statistics stage libraries. You can create and manage two separate Data Collector Docker images for the deployments. Or you can create and manage a single image that meets the needs of both deployments.

For more information about running Data Collector from Docker, see https://hub.docker.com/r/streamsets/datacollector/.

For more information about creating private Docker images and publishing them to a private repository, see the Docker documentation.

Step 2. Create a Provisioning Agent

Use one of the following methods to create a Provisioning Agent:

Using Helm: Helm is a tool that streamlines installing and managing Kubernetes applications.
Without using Helm: If you do not want to use Helm, you can define a Provisioning Agent YAML specification file, and then use Kubernetes commands to create and deploy the Provisioning Agent.

When you use either method, you can configure the Provisioning Agent to provision Data Collector containers enabled for Kerberos authentication. However, StreamSets recommends using Helm to enable Kerberos authentication.

Creating an Agent Using Helm

To create a Provisioning Agent using Helm, install Helm and download the Control Agent Helm chart that StreamSets provides. After modifying values in the Helm chart, use the Helm install command to create and deploy the Provisioning Agent as a containerized application to a Kubernetes pod.

Create one Provisioning Agent for each Kubernetes cluster where you want to provision Data Collectors. For example, if you have a production cluster and a disaster recovery cluster, you would create a total of two Provisioning Agents - one for each cluster.

Install Helm.

For Helm download and installation instructions, see the Helm project on GitHub.
After installing Helm, download the Control Agent Helm chart from the StreamSets Helm Charts GitHub repository.
A Helm chart is a collection of files that describe a related set of Kubernetes resources. After you download the StreamSets Control Agent Helm chart, you'll have a set of files in the following directory, where <chart_directory> is the root directory of the downloaded chart:
```
<chart_directory>/control-agent
```
For more information about Helm charts, see the Helm documentation.
Complete the following steps in Control Hub to generate the authentication token for the Provisioning Agent:
1. In the Navigation panel, click Administration > Provisioning Agents.
2. Click the Generate Authentication Tokens icon .
3. Click Generate.
4. Copy the token from the window and paste it in a local text file so that you can access it when you modify the YAML specification in the next step.
Open the <chart_directory>/control-agent/values.yaml file included with the StreamSets Control Agent Helm chart.
Complete the following steps to modify the file:
1. Replace the variables defined for the streamsets attribute.
  For example, the streamsets:orgId attribute is defined as follows:
```
orgId: <your org id>
```
  Replace the variable with your Control Hub organization ID, as follows:
```
orgId: MyCompany
```
2. To enable Kerberos authentication for the provisioned Data Collector containers, set the krb:enabled attribute to true and then replace all of the variables for the remaining krb attributes.
  If you are not enabling Kerberos authentication, do not make any changes to the krb attributes.
To enable Kerberos authentication for the provisioned Data Collector containers, create a folder named krb under the <chart_directory>/control-agent folder. Then, copy the Kerberos configuration file, krb5.conf, to this folder.

The krb5.conf file contains Kerberos configuration information, including the locations of key distribution centers (KDCs) and admin servers for the Kerberos realms, defaults for the current realm, and mappings of hostnames onto Kerberos realms.
Run the following Helm command to install the chart:
```
helm install streamsets/control-agent
```
The command creates and deploys the Provisioning Agent as a containerized application to a Kubernetes pod. When the Provisioning Agent starts, it uses the authentication token in the values YAML file to register itself with Control Hub.
To verify that the Provisioning Agent is up and running and successfully registered with Control Hub, click Execute > Provisioning Agents in the Control Hub Navigation panel.
The Provisioning Agents view displays all registered Provisioning Agents.

Creating an Agent without Using Helm

To create a Provisioning Agent without using Helm, configure a Provisioning Agent YAML specification file, and then use the Kubernetes create command to create and deploy the Provisioning Agent as a containerized application to a Kubernetes pod.

Note: StreamSets recommends using Helm to configure a Provisioning Agent to provision Data Collector containers enabled for Kerberos authentication.

Use Kubernetes to create a secret in the Kubernetes namespace where you plan to run the Provisioning Agent.
On the Kubernetes cluster, run the following command, where <secretName> is the name of the secret and <agentNamespace> is the namespace where you plan to run the Provisioning Agent:
```
kubectl create secret generic <secretName> --namespace=<agentNamespace>
```
You must create a secret for the Provisioning Agent so that if the Provisioning Agent fails over to another Kubernetes pod, the agent can continue to manage the Data Collector containers that it already deployed. Each Provisioning Agent requires a unique secret.

For more information about creating Kubernetes secrets, see the Kubernetes documentation.
Complete the following steps in Control Hub to generate the authentication token for the Provisioning Agent:
1. In the Control Hub Navigation panel, click Administration > Provisioning Agents.
2. Click the Generate Authentication Tokens icon .
3. Click Generate.
4. Copy the token from the window and paste it in a local text file so that you can access it when you modify the YAML specification in the next step.

Copy the following lines of code into a YAML specification file:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: <agentName>
  namespace: <agentNamespace>
spec:
  replicas: 1
  selector:
    matchLabels:
      app: agent
  template:
    metadata:
      labels:
        app : agent
    spec:
      volumes:
      - name: krb5conf
        secret:
           secretName: krb5conf
      containers:
      - name : <agentName>
        image: streamsets/control-agent:<agentVersion>
        volumeMounts: //Mount 
        - name: krb5conf
          mountPath: "/opt/kerberos/krb5.conf"
        env:
        - name: HOST
          valueFrom:
            fieldRef:
              fieldPath: status.podIP
        - name : dpm_agent_master_url
          value: <kubernetesMasterUrl>
        - name : dpm_agent_cof_type
          value: "KUBERNETES"
        - name : dpm_agent_dpm_baseurl
          value : <schBaseUrl>
        - name : dpm_agent_component_id
          value : <agentComponentId>
        - name : dpm_agent_token_string
          value : <agentTokenString>
        - name : dpm_agent_name
          value : <agentName>
        - name : dpm_agent_orgId
          value : <schOrgId>
        - name: dpm_agent_kerberos_enabled
          value: "true"
        - name: KRB5_CONFIG
          value: "/opt/kerberos/krb5.conf"
        - name: dpm_agent_kerberos_secret
          value: <kerbsecret>
        - name: dpm_agent_kdc_type
          value: <AD|MIT>
        - name : dpm_agent_secret
          value : <secretName>

If you are not enabling Kerberos authentication, remove the following Kerberos attributes from the file:

...
    volumes:
      - name: krb5conf
        secret:
           secretName: krb5conf
...
    volumeMounts: //Mount 
      - name: krb5conf
        mountPath: "/opt/kerberos/krb5.conf"
...
     
      - name: dpm_agent_kerberos_enabled
        value: "true"
      - name: KRB5_CONFIG
        value: "/opt/kerberos/krb5.conf"
      - name: dpm_agent_kerberos_secret
        value: <kerbsecret>
      - name: dpm_agent_kdc_type
        value: <AD|MIT>

If you are enabling Kerberos authentication, create a secret named krb5conf for the Kerberos configuration file, krb5.conf.

The krb5.conf file contains Kerberos configuration information, including the locations of key distribution centers (KDCs) and admin servers for the Kerberos realms, defaults for the current realm, and mappings of hostnames onto Kerberos realms.

Replace the following variables in the file with the appropriate attribute values:


Variable	Description
agentName	Name of the Provisioning Agent.
agentVersion	Version of the StreamSets Control Agent Docker image. Use `latest` or version 5.0.0 or later. For example: `image: streamsets/control-agent:latest`
agentNamespace	Namespace of the Provisioning Agent in Kubernetes. Use the same namespace to create deployments for this Provisioning Agent.
kubernetesMasterUrl	Kubernetes master URL for your Kubernetes cluster.
schBaseUrl	URL to access Control Hub. Set to the Control Hub URL provided by your system administrator. For example, `https://<hostname>:18631`.
agentComponentId	Unique ID for this Provisioning Agent within Control Hub. For example, use `agent_<organizationID>` if your organization requires a single Provisioning Agent. Or use `agentprod_<organizationID>` and `agentrecovery_<organizationID>` if your organization requires one agent for a production cluster and another agent for a disaster recovery cluster.
agentTokenString	Authentication token that you generated for the Provisioning Agent in step 2.
schOrgId	Control Hub organization ID.
kerbsecret	Optional. If enabling Kerberos authentication, the secret used for Kerberos authentication that contains the following values: encryption_types container_dn, if using Active Directory ldap_url, if using Active Directory admin_principal admin_key
AD\|MIT	Optional. If enabling Kerberos authentication, the authentication type for the Kerberos key distribution center: Active Directory or MIT Kerberos.
secretName	Secret name that you created for the Provisioning Agent in the Kubernetes namespace in step 1.

Save the YAML specification with an appropriate file name, for example: schAgent.yml.
On the Kubernetes cluster, run the following command, where <fileName> is the name of your saved YAML file:
```
kubectl create -f <fileName>.yml
```
The command creates and deploys the Provisioning Agent as a containerized application to a Kubernetes pod. When the Provisioning Agent starts, it uses the authentication token in the YAML file to register itself with Control Hub.
To verify that the Provisioning Agent is up and running and successfully registered with Control Hub, click Execute > Provisioning Agents in the Navigation panel.
The Provisioning Agents view displays all registered Provisioning Agents.

Step 3. Define a Deployment YAML Specification

Define a deployment in a YAML specification file. Each file can define a single deployment. The file can optionally define a Kubernetes Horizontal Pod Autoscaler associated with the deployment.

In most cases, you define a single Data Collector container within a deployment specification. To define multiple containers, use the StreamSets Control Agent Docker image version 5.1.0 or later and define the Data Collector container as the first element in the list of containers.

Important: The YAML specification file must use the Kubernetes API version apps/v1 to define each deployment.

The YAML specification file can contain the following components:

Deployment: Use for a deployment of one or more Data Collectors that can be manually scaled. To manually scale a deployment, you modify a deployment in the Control Hub UI to increase the number of Data Collector instances.; For a sample specification file, see Deployment Sample.
Deployment associated with a Kubernetes Horizontal Pod Autoscaler: Use for a deployment of one or more Data Collectors that must automatically scale during times of peak performance. Define the deployment and Horizontal Pod Autoscaler in the same YAML specification file. The Kubernetes Horizontal Pod Autoscaler automatically scales the deployment based on CPU utilization. For more information, see the Kubernetes Horizontal Pod Autoscaler documentation.; For a sample specification file, see Deployment and Horizontal Pod Autoscaler Sample.

When provisioning Data Collectors to AWS Fargate with Amazon Elastic Kubernetes Service (EKS), you must add additional attributes to the deployment YAML specification file.

Important: For a deployment to be manually or automatically scaled, jobs that run on Data Collector containers in that deployment must be configured to automatically scale out pipeline processing.

Deployment Sample

Define only a deployment in the YAML specification file when creating a deployment for one or more Data Collectors that can be manually scaled.

The following sample YAML specification file defines only a deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: datacollector-deployment
  namespace: <agentNamespace>
spec:
  replicas: 1
  selector:
    matchLabels:
      app: <deploymentLabel>
  template:
    metadata:
      labels:
        app : <deploymentLabel>
        kerberosEnabled: true
        krbPrincipal: <KerberosUser>
    spec:
      containers:
      - name : datacollector
        image: <privateImage>
        ports:
        - containerPort: 18630
        volumeMounts:
         - name: krb5conf
           mountPath: /etc/krb5.conf
           subPath: krb5.conf
           readOnly: true
        env:
        - name: HOST
          valueFrom:
            fieldRef:
              fieldPath: status.podIP
        - name: PORT0
          value: "18630"
      imagePullSecrets:
      - name: <imagePullSecrets>
      volumes:
      - name: krb5conf
        secret:
          secretName: krb5conf

If not enabling Kerberos authentication, you'd remove the following Kerberos attributes from the sample file:

...
      kerberosEnabled: true
      krbPrincipal: <KerberosUser>
...
      volumeMounts:
         - name: krb5conf
           mountPath: /etc/krb5.conf
           subPath: krb5.conf
           readOnly: true
...
      volumes:
     - name: krb5conf
       secret:
         secretName: krb5conf

Replace the following variables in the sample file with the appropriate attribute values:


Variable	Description
agentNamespace	Namespace used for the Provisioning Agent that manages this deployment.
deploymentLabel	Label for this deployment. Must be unique for all deployments managed by the Provisioning Agent.
KerberosUser	User for the Kerberos principal when enabling Kerberos authentication. This attribute is optional. If you remove this attribute, the Provisioning Agent uses `sdc` as the Kerberos user. The Provisioning Agent creates a unique Kerberos principal for each deployed Data Collector container using the following format: `<KerberosUser>/<host>@<realm>`. The agent determines the host and realm to use, creates the Kerberos principal, and generates the keytab for that principal. For example, if you define the `KerberosUser` attribute as `marketing` and the Provisioning Agent deploys two Data Collector containers, the agent creates the following Kerberos principals: `marketing/10.60.1.25@EXAMPLE.COM marketing/10.60.1.26@EXAMPLE.COM`
privateImage	Path to your private Data Collector Docker image stored in your private repository. Or, if using the public StreamSets Data Collector Docker image, modify the attribute as follows: `image: streamsets/datacollector:<version>` Where `<version>` is the Data Collector version. For example: `image: streamsets/datacollector:4.1.0`
imagePullSecrets	Pull secrets required for the private image stored in your private repository. If using the public StreamSets Data Collector Docker image, remove these lines.

Deployment and Horizontal Pod Autoscaler Sample

Define a deployment and Horizontal Pod Autoscaler in the YAML specification file when creating a deployment for one or more Data Collectors that automatically scale during times of peak performance.

The following sample YAML specification file defines a deployment associated with a Kubernetes Horizontal Pod Autoscaler:

apiVersion: v1
kind: List
items:
- apiVersion: apps/v1
  kind: Deployment
  metadata:
    name: datacollector-deployment
    namespace: <agentNamespace>
  spec:
    replicas: 1
    selector:
      matchLabels:
        app: <deploymentLabel>
    template:
      metadata:
        labels:
          app : <deploymentLabel>
          kerberosEnabled: true
          krbPrincipal: <KerberosUser>
      spec:
        containers:
        - name : datacollector
          image: <privateImage>
          ports:
          - containerPort: 18630
          volumeMounts:
         - name: krb5conf
           mountPath: /etc/krb5.conf
           subPath: krb5.conf
           readOnly: true
          env:
          - name: HOST
            valueFrom:
              fieldRef:
                fieldPath: status.podIP
          - name: PORT0
            value: "18630"
        imagePullSecrets:
        - name: <imagePullSecrets>
        volumes:
        - name: krb5conf
          secret:
            secretName: krb5conf
- apiVersion: autoscaling/v1
  kind: HorizontalPodAutoscaler
  metadata:
    name: datacollector-hpa
    namespace: <agentNamespace>
  spec:
    scaleTargetRef:
      apiVersion: apps/v1beta1
      kind: Deployment
      name: <deploymentLabel>
    minReplicas: 1 
    maxReplicas: 10
    targetCPUUtilizationPercentage: 50

If not enabling Kerberos authentication, you'd remove the following Kerberos attributes from the sample file:

...
      kerberosEnabled: true
      krbPrincipal: <KerberosUser>
...
      volumeMounts:
         - name: krb5conf
           mountPath: /etc/krb5.conf
           subPath: krb5.conf
           readOnly: true
...
      volumes:
     - name: krb5conf
       secret:
         secretName: krb5conf

Replace the following variables in the sample file with the appropriate attribute values:


Variable	Description
agentNamespace	Namespace used for the Provisioning Agent that manages this deployment.
deploymentLabel	Label for this deployment. Must be unique for all deployments managed by the Provisioning Agent.
KerberosUser	User for the Kerberos principal when enabling Kerberos authentication. This attribute is optional. If you remove this attribute, the Provisioning Agent uses `sdc` as the Kerberos user. The Provisioning Agent creates a unique Kerberos principal for each deployed Data Collector container using the following format: `<KerberosUser>/<host>@<realm>`. The agent determines the host and realm to use, creates the Kerberos principal, and generates the keytab for that principal. For example, if you define the `KerberosUser` attribute as `marketing` and the Provisioning Agent deploys two Data Collector containers, the agent creates the following Kerberos principals: `marketing/10.60.1.25@EXAMPLE.COM marketing/10.60.1.26@EXAMPLE.COM`
privateImage	Path to your private Data Collector Docker image stored in your private repository. Or, if using the public StreamSets Data Collector Docker image, modify the attribute as follows: `image: streamsets/datacollector:<version>` Where `<version>` is the Data Collector version. For example: `image: streamsets/datacollector:4.1.0`
imagePullSecrets	Pull secrets required for the private image stored in your private repository. If using the public StreamSets Data Collector Docker image, remove these lines.

When a specification file defines a deployment and Horizontal Pod Autoscaler, the Horizontal Pod Autoscaler must be associated to the deployment defined in the same file. In the sample above, the Horizontal Pod Autoscaler is associated to the defined deployment with the following attributes:

kind: Deployment
name: <deploymentLabel>

In the Horizontal Pod Autoscaler definition, you also might want to modify the minimum and maximum replica values and the target CPU utilization percentage value. For more information on these values, see the Kubernetes Horizontal Pod Autoscaler documentation.

Attributes for AWS Fargate with EKS

When provisioning Data Collectors to AWS Fargate with Amazon Elastic Kubernetes Service (EKS), add the following additional attributes to the deployment YAML specification file:

Required attribute

Add the following required environment variable to avoid having to configure the maximum open file limit on the virtual machines provisioned by AWS Fargate:

- name: SDC_FILE_LIMIT
  value: 0

Optional attribute

Add the following optional resources attribute to define the size of the virtual machines that AWS Fargate provisions. Set the values of the cpu and memory attributes as needed:

resources:
  limits:
    cpu: 500m
    memory: 2G
  requests:
    cpu: 200m
    memory: 2G

For example, to define a deployment only on AWS Fargate with EKS using the public StreamSets Data Collector Docker image, use the following sample YAML specification file. The additional attributes used by AWS Fargate are in bold:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: datacollector-deployment
  namespace: <agentNamespace>
spec:
  replicas: 1
  selector:
    matchLabels:
      app: <deploymentLabel>
  template:
    metadata:
      labels:
        app : <deploymentLabel>
    spec:
      containers:
      - name : datacollector
        image: <privateImage>
        ports:
        - containerPort: 18630
        env:
        - name: HOST
          valueFrom:
            fieldRef:
              fieldPath: status.podIP
        - name: PORT0
          value: "18630"
        - name: SDC_FILE_LIMIT
          value: 0
        resources:
           limits:
             cpu: 500m
             memory: 2G
           requests:
             cpu: 200m
             memory: 2G

Step 4. Create a Deployment

After defining the deployment YAML specification file, use Control Hub to create a deployment.

You can create multiple deployments for a single Provisioning Agent. For example, for the Provisioning Agent running in the production cluster, you might create one deployment dedicated to running jobs that read web server logs and another deployment dedicated to running jobs that read data from Google Cloud.

In the Navigation panel, click Execute > Deployments.
Click the Add Deployment icon: .

On the Add Deployment window, configure the following properties:


Deployment Property	Description
Name	Deployment name.
Description	Optional description.
Agent Type	Type of container orchestration framework where the Provisioning Agent runs. At this time, only Kubernetes is supported.
Provisioning Agent	Name of the Provisioning Agent that manages the deployment.
Number of Data Collector Instances	Number of Data Collector container instances to deploy.
Data Collector Labels	Label or labels to assign to all Data Collector containers provisioned by this deployment. Labels determine the group of Data Collectors that run a job. For more information about labels, see Labels.

In the YAML Specification property, use one of the following methods to replace the sample lines with the deployment YAML specification file that you defined in the previous step:
- Paste the content from your file into the property.
- Click File, select the file you defined, and then click Open to upload the file into the property.
Click Save.

Step 5. Start the Deployment

When you start a deployment, the Provisioning Agent deploys the Data Collector containers to the Kubernetes cluster and starts each Data Collector container.

If you configured the Provisioning Agent for Kerberos authentication, the Provisioning Agent works with Kerberos to dynamically create and inject Kerberos credentials (a service principal and keytab) into each deployed Data Collector container.

The agent deploys each container to a Kubernetes pod. So if the deployment specifies three Data Collector instances, the agent deploys three containers to three Kubernetes pods.

During the startup of each Data Collector container, the Data Collector registers itself with Control Hub.

On the Execute > Deployments view, select the inactive deployment and then click the Start Deployment icon: .
It can take the Provisioning Agent up to a minute to provision the Data Collector containers. When complete, Control Hub indicates that the deployment is active.
To verify that the Data Collector containers were successfully registered and are up and running, click Execute > Data Collectors in the Navigation panel.
The Data Collectors view displays all registered Data Collectors - either manually administered or automatically provisioned.