Provision Data Collectors

Step 1. Create a Custom Image

Use Docker to customize the public StreamSets Data Collector Docker image as needed, and then store the private image in your private repository.

Include the following customizations in the private image, based on your requirements:
  • Customized configuration files
  • Resource files
  • External libraries, such as JDBC drivers
  • Custom stages
  • Additional stage libraries - The public Data Collector Docker image includes the basic, development, and Windows stage libraries only.
  • Packages and files required to enable Kerberos authentication for Data Collector:
    • On Linux, the krb5-workstation and krb5-client Kerberos client packages.
    • The Hadoop or HDFS configuration files required by the Kerberos-enabled stage, for example:
      • core-site.xml
      • hdfs-site.xml
      • yarn-site.xml
      • mapred-site.xml

Each deployment managed by a Provisioning Agent specifies the Data Collector Docker image to deploy. So you can create a unique Data Collector Docker image for each deployment, or you can use one Docker image for all deployments.

For example, let's say that one deployment of provisioned Data Collectors reads from web server logs, so the Data Collector Docker image used by that deployment requires only the basic and statistics stage libraries. Another deployment of provisioned Data Collectors reads from the Google Cloud platform, so the Data Collector Docker image used by that deployment requires the Google Cloud stage library in addition to the basic and statistics stage libraries. You can create and manage two separate Data Collector Docker images for the deployments. Or you can create and manage a single image that meets the needs of both deployments.

For more information about running Data Collector from Docker, see https://hub.docker.com/r/streamsets/datacollector/.

For more information about creating private Docker images and publishing them to a private repository, see the Docker documentation.

Step 2. Create a Provisioning Agent

Use one of the following methods to create a Provisioning Agent:
Using Helm
Helm is a tool that streamlines installing and managing Kubernetes applications.
Without using Helm
If you do not want to use Helm, you can define a Provisioning Agent YAML specification file, and then use Kubernetes commands to create and deploy the Provisioning Agent.

When you use either method, you can configure the Provisioning Agent to provision Data Collector containers enabled for Kerberos authentication. However, StreamSets recommends using Helm to enable Kerberos authentication.

Creating an Agent Using Helm

To create a Provisioning Agent using Helm, install Helm and download the Control Agent Helm chart that StreamSets provides. After modifying values in the Helm chart, use the Helm install command to create and deploy the Provisioning Agent as a containerized application to a Kubernetes pod.

Create one Provisioning Agent for each Kubernetes cluster where you want to provision Data Collectors. For example, if you have a production cluster and a disaster recovery cluster, you would create a total of two Provisioning Agents - one for each cluster.

  1. Install Helm.

    For Helm download and installation instructions, see the Helm project on GitHub.

  2. After installing Helm, download the Control Agent Helm chart from the StreamSets Helm Charts GitHub repository.
    A Helm chart is a collection of files that describe a related set of Kubernetes resources. After you download the StreamSets Control Agent Helm chart, you'll have a set of files in the following directory, where <chart_directory> is the root directory of the downloaded chart:
    <chart_directory>/control-agent

    For more information about Helm charts, see the Helm documentation.

  3. Complete the following steps in Control Hub to generate the authentication token for the Provisioning Agent:
    1. In the Navigation panel, click Legacy Kubernetes > Provisioning Agents.
    2. Click the Provisioning Agent Components tab.
    3. Click the Generate Authentication Tokens icon .
    4. Click Generate.
    5. Copy the token from the window and paste it in a local text file so that you can access it when you modify the YAML specification in the next step.
  4. Open the <chart_directory>/control-agent/values.yaml file included with the StreamSets Control Agent Helm chart.
  5. Complete the following steps to modify the file:
    1. Replace the variables defined for the streamsets attribute.
      For example, the streamsets:orgId attribute is defined as follows:
      orgId: <your org id>
      Replace the variable with your Control Hub organization ID, as follows:
      orgId: MyCompany
    2. To enable Kerberos authentication for the provisioned Data Collector containers, set the krb:enabled attribute to true and then replace all of the variables for the remaining krb attributes.
      If you are not enabling Kerberos authentication, do not make any changes to the krb attributes.
  6. To enable Kerberos authentication for the provisioned Data Collector containers, create a folder named krb under the <chart_directory>/control-agent folder. Then, copy the Kerberos configuration file, krb5.conf, to this folder.

    The krb5.conf file contains Kerberos configuration information, including the locations of key distribution centers (KDCs) and admin servers for the Kerberos realms, defaults for the current realm, and mappings of hostnames onto Kerberos realms.

  7. Run the following Helm command to install the chart:
    helm install streamsets/control-agent

    The command creates and deploys the Provisioning Agent as a containerized application to a Kubernetes pod. When the Provisioning Agent starts, it uses the authentication token in the values YAML file to register itself with Control Hub.

  8. To verify that the Provisioning Agent is up and running and successfully registered with Control Hub, click Legacy Kubernetes > Provisioning Agents in the Control Hub Navigation panel.
    The Provisioning Agents view displays all registered Provisioning Agents.

Creating an Agent without Using Helm

To create a Provisioning Agent without using Helm, configure a Provisioning Agent YAML specification file, and then use the Kubernetes create command to create and deploy the Provisioning Agent as a containerized application to a Kubernetes pod.

Note: StreamSets recommends using Helm to configure a Provisioning Agent to provision Data Collector containers enabled for Kerberos authentication.

Create one Provisioning Agent for each Kubernetes cluster where you want to provision Data Collectors. For example, if you have a production cluster and a disaster recovery cluster, you would create a total of two Provisioning Agents - one for each cluster.

  1. Use Kubernetes to create a secret in the Kubernetes namespace where you plan to run the Provisioning Agent.
    On the Kubernetes cluster, run the following command, where <secretName> is the name of the secret and <agentNamespace> is the namespace where you plan to run the Provisioning Agent:
    kubectl create secret generic <secretName> --namespace=<agentNamespace>

    You must create a secret for the Provisioning Agent so that if the Provisioning Agent fails over to another Kubernetes pod, the agent can continue to manage the Data Collector containers that it already deployed. Each Provisioning Agent requires a unique secret.

    For more information about creating Kubernetes secrets, see the Kubernetes documentation.

  2. Complete the following steps in Control Hub to generate the authentication token for the Provisioning Agent:
    1. In the Control Hub Navigation panel, click Legacy Kubernetes > Provisioning Agents.
    2. Click the Generate Authentication Tokens icon .
    3. Click Generate.
    4. Copy the token from the window and paste it in a local text file so that you can access it when you modify the YAML specification in the next step.
  3. Copy the following lines of code into a YAML specification file:
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: <agentName>
      namespace: <agentNamespace>
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: agent
      template:
        metadata:
          labels:
            app : agent
        spec:
          volumes:
          - name: krb5conf
            secret:
               secretName: krb5conf
          containers:
          - name : <agentName>
            image: streamsets/control-agent:<agentVersion>
            volumeMounts: //Mount 
            - name: krb5conf
              mountPath: "/opt/kerberos/krb5.conf"
            env:
            - name: HOST
              valueFrom:
                fieldRef:
                  fieldPath: status.podIP
            - name : dpm_agent_master_url
              value: <kubernetesMasterUrl>
            - name : dpm_agent_cof_type
              value: "KUBERNETES"
            - name : dpm_agent_dpm_baseurl
              value : <schBaseUrl>
            - name : dpm_agent_component_id
              value : <agentComponentId>
            - name : dpm_agent_token_string
              value : <agentTokenString>
            - name : dpm_agent_name
              value : <agentName>
            - name : dpm_agent_orgId
              value : <schOrgId>
            - name: dpm_agent_kerberos_enabled
              value: "true"
            - name: KRB5_CONFIG
              value: "/opt/kerberos/krb5.conf"
            - name: dpm_agent_kerberos_secret
              value: <kerbsecret>
            - name: dpm_agent_kdc_type
              value: <AD|MIT>
            - name : dpm_agent_secret
              value : <secretName>
  4. If you are not enabling Kerberos authentication, remove the following Kerberos attributes from the file:
    ...
        volumes:
          - name: krb5conf
            secret:
               secretName: krb5conf
    ...
        volumeMounts: //Mount 
          - name: krb5conf
            mountPath: "/opt/kerberos/krb5.conf"
    ...
         
          - name: dpm_agent_kerberos_enabled
            value: "true"
          - name: KRB5_CONFIG
            value: "/opt/kerberos/krb5.conf"
          - name: dpm_agent_kerberos_secret
            value: <kerbsecret>
          - name: dpm_agent_kdc_type
            value: <AD|MIT>
  5. If you are enabling Kerberos authentication, create a secret named krb5conf for the Kerberos configuration file, krb5.conf.

    The krb5.conf file contains Kerberos configuration information, including the locations of key distribution centers (KDCs) and admin servers for the Kerberos realms, defaults for the current realm, and mappings of hostnames onto Kerberos realms.

  6. Replace the following variables in the file with the appropriate attribute values:
    Variable Description
    agentName Name of the Provisioning Agent.
    agentVersion Version of the StreamSets Control Agent Docker image. Use latest or version 5.0.0 or later. For example:
    image: streamsets/control-agent:latest
    agentNamespace Namespace of the Provisioning Agent in Kubernetes.

    Use the same namespace to create deployments for this Provisioning Agent.

    kubernetesMasterUrl Kubernetes master URL for your Kubernetes cluster.
    schBaseUrl URL to access Control Hub.

    Set to https://<location>.streamsets.com, based on the selected location for your organization.

    agentComponentId Unique ID for this Provisioning Agent within Control Hub.

    For example, use agent_<organizationID> if your organization requires a single Provisioning Agent. Or use agentprod_<organizationID> and agentrecovery_<organizationID> if your organization requires one agent for a production cluster and another agent for a disaster recovery cluster.

    agentTokenString Authentication token that you generated for the Provisioning Agent in step 2.
    schOrgId Control Hub organization ID.
    kerbsecret Optional. If enabling Kerberos authentication, the secret used for Kerberos authentication that contains the following values:
    • encryption_types
    • container_dn, if using Active Directory
    • ldap_url, if using Active Directory
    • admin_principal
    • admin_key
    AD|MIT Optional. If enabling Kerberos authentication, the authentication type for the Kerberos key distribution center: Active Directory or MIT Kerberos.
    secretName Secret name that you created for the Provisioning Agent in the Kubernetes namespace in step 1.
  7. Save the YAML specification with an appropriate file name, for example: schAgent.yml.
  8. On the Kubernetes cluster, run the following command, where <fileName> is the name of your saved YAML file:
    kubectl create -f <fileName>.yml

    The command creates and deploys the Provisioning Agent as a containerized application to a Kubernetes pod. When the Provisioning Agent starts, it uses the authentication token in the YAML file to register itself with Control Hub.

  9. To verify that the Provisioning Agent is up and running and successfully registered with Control Hub, click Legacy Kubernetes > Provisioning Agents in the Navigation panel.
    The Provisioning Agents view displays all registered Provisioning Agents.

Step 3. Define a Deployment YAML Specification

Define a legacy deployment in a YAML specification file. Each file can define a single deployment. The file can optionally define a Kubernetes Horizontal Pod Autoscaler associated with the deployment.

In most cases, you define a single Data Collector container within a deployment specification. To define multiple containers, use the StreamSets Control Agent Docker image version 5.1.0 or later and define the Data Collector container as the first element in the list of containers.

Important: The YAML specification file must use the Kubernetes API version apps/v1 to define each deployment.

The YAML specification file can contain the following components:

Deployment
Use for a legacy deployment of one or more Data Collectors that can be manually scaled. To manually scale a deployment, you modify a deployment in the Control Hub UI to increase the number of Data Collector instances.
For a sample specification file, see Deployment Sample.
Deployment associated with a Kubernetes Horizontal Pod Autoscaler
Use for a legacy deployment of one or more Data Collectors that must automatically scale during times of peak performance. Define the deployment and Horizontal Pod Autoscaler in the same YAML specification file. The Kubernetes Horizontal Pod Autoscaler automatically scales the deployment based on CPU utilization. For more information, see the Kubernetes Horizontal Pod Autoscaler documentation.
For a sample specification file, see Deployment and Horizontal Pod Autoscaler Sample.
When provisioning Data Collectors to AWS Fargate with Amazon Elastic Kubernetes Service (EKS), you must add additional attributes to the deployment YAML specification file.
Important: For a deployment to be manually or automatically scaled, jobs that run on Data Collector containers in that deployment must be configured to automatically scale out pipeline processing.

Deployment Sample

Define only a deployment in the YAML specification file when creating a deployment for one or more Data Collectors that can be manually scaled.

The following sample YAML specification file defines only a deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
  name: datacollector-deployment
  namespace: <agentNamespace>
spec:
  replicas: 1
  selector:
    matchLabels:
      app: <deploymentLabel>
  template:
    metadata:
      labels:
        app : <deploymentLabel>
        kerberosEnabled: true
        krbPrincipal: <KerberosUser>
    spec:
      containers:
      - name : datacollector
        image: <privateImage>
        ports:
        - containerPort: 18630
        volumeMounts:
         - name: krb5conf
           mountPath: /etc/krb5.conf
           subPath: krb5.conf
           readOnly: true
        env:
        - name: HOST
          valueFrom:
            fieldRef:
              fieldPath: status.podIP
        - name: PORT0
          value: "18630"
      imagePullSecrets:
      - name: <imagePullSecrets>
      volumes:
      - name: krb5conf
        secret:
          secretName: krb5conf
If not enabling Kerberos authentication, you'd remove the following Kerberos attributes from the sample file:
...
      kerberosEnabled: true
      krbPrincipal: <KerberosUser>
...
      volumeMounts:
         - name: krb5conf
           mountPath: /etc/krb5.conf
           subPath: krb5.conf
           readOnly: true
...
      volumes:
     - name: krb5conf
       secret:
         secretName: krb5conf
Replace the following variables in the sample file with the appropriate attribute values:
Variable Description
agentNamespace Namespace used for the Provisioning Agent that manages this deployment.
deploymentLabel Label for this deployment. Must be unique for all deployments managed by the Provisioning Agent.
KerberosUser User for the Kerberos principal when enabling Kerberos authentication.

This attribute is optional. If you remove this attribute, the Provisioning Agent uses sdc as the Kerberos user.

The Provisioning Agent creates a unique Kerberos principal for each deployed Data Collector container using the following format: <KerberosUser>/<host>@<realm>. The agent determines the host and realm to use, creates the Kerberos principal, and generates the keytab for that principal.

For example, if you define the KerberosUser attribute as marketing and the Provisioning Agent deploys two Data Collector containers, the agent creates the following Kerberos principals:
marketing/10.60.1.25@EXAMPLE.COM
marketing/10.60.1.26@EXAMPLE.COM
privateImage Path to your private Data Collector Docker image stored in your private repository.
Or, if using the public StreamSets Data Collector Docker image, modify the attribute as follows:
image: streamsets/datacollector:<version>
Where <version> is the Data Collector version. For example:
image: streamsets/datacollector:4.1.0
imagePullSecrets Pull secrets required for the private image stored in your private repository.

If using the public StreamSets Data Collector Docker image, remove these lines.

Deployment and Horizontal Pod Autoscaler Sample

Define a deployment and Horizontal Pod Autoscaler in the YAML specification file when creating a deployment for one or more Data Collectors that automatically scale during times of peak performance.

The following sample YAML specification file defines a deployment associated with a Kubernetes Horizontal Pod Autoscaler:
apiVersion: v1
kind: List
items:
- apiVersion: apps/v1
  kind: Deployment
  metadata:
    name: datacollector-deployment
    namespace: <agentNamespace>
  spec:
    replicas: 1
    selector:
      matchLabels:
        app: <deploymentLabel>
    template:
      metadata:
        labels:
          app : <deploymentLabel>
          kerberosEnabled: true
          krbPrincipal: <KerberosUser>
      spec:
        containers:
        - name : datacollector
          image: <privateImage>
          ports:
          - containerPort: 18630
          volumeMounts:
         - name: krb5conf
           mountPath: /etc/krb5.conf
           subPath: krb5.conf
           readOnly: true
          env:
          - name: HOST
            valueFrom:
              fieldRef:
                fieldPath: status.podIP
          - name: PORT0
            value: "18630"
        imagePullSecrets:
        - name: <imagePullSecrets>
        volumes:
        - name: krb5conf
          secret:
            secretName: krb5conf
- apiVersion: autoscaling/v1
  kind: HorizontalPodAutoscaler
  metadata:
    name: datacollector-hpa
    namespace: <agentNamespace>
  spec:
    scaleTargetRef:
      apiVersion: apps/v1beta1
      kind: Deployment
      name: <deploymentLabel>
    minReplicas: 1 
    maxReplicas: 10
    targetCPUUtilizationPercentage: 50
If not enabling Kerberos authentication, you'd remove the following Kerberos attributes from the sample file:
...
      kerberosEnabled: true
      krbPrincipal: <KerberosUser>
...
      volumeMounts:
         - name: krb5conf
           mountPath: /etc/krb5.conf
           subPath: krb5.conf
           readOnly: true
...
      volumes:
     - name: krb5conf
       secret:
         secretName: krb5conf
Replace the following variables in the sample file with the appropriate attribute values:
Variable Description
agentNamespace Namespace used for the Provisioning Agent that manages this deployment.
deploymentLabel Label for this deployment. Must be unique for all deployments managed by the Provisioning Agent.
KerberosUser User for the Kerberos principal when enabling Kerberos authentication.

This attribute is optional. If you remove this attribute, the Provisioning Agent uses sdc as the Kerberos user.

The Provisioning Agent creates a unique Kerberos principal for each deployed Data Collector container using the following format: <KerberosUser>/<host>@<realm>. The agent determines the host and realm to use, creates the Kerberos principal, and generates the keytab for that principal.

For example, if you define the KerberosUser attribute as marketing and the Provisioning Agent deploys two Data Collector containers, the agent creates the following Kerberos principals:
marketing/10.60.1.25@EXAMPLE.COM
marketing/10.60.1.26@EXAMPLE.COM
privateImage Path to your private Data Collector Docker image stored in your private repository.
Or, if using the public StreamSets Data Collector Docker image, modify the attribute as follows:
image: streamsets/datacollector:<version>
Where <version> is the Data Collector version. For example:
image: streamsets/datacollector:4.1.0
imagePullSecrets Pull secrets required for the private image stored in your private repository.

If using the public StreamSets Data Collector Docker image, remove these lines.

When a specification file defines a deployment and Horizontal Pod Autoscaler, the Horizontal Pod Autoscaler must be associated to the deployment defined in the same file. In the sample above, the Horizontal Pod Autoscaler is associated to the defined deployment with the following attributes:
kind: Deployment
name: <deploymentLabel>

In the Horizontal Pod Autoscaler definition, you also might want to modify the minimum and maximum replica values and the target CPU utilization percentage value. For more information on these values, see the Kubernetes Horizontal Pod Autoscaler documentation.

Attributes for AWS Fargate with EKS

When provisioning Data Collectors to AWS Fargate with Amazon Elastic Kubernetes Service (EKS), add the following additional attributes to the deployment YAML specification file:

Required attribute
Add the following required environment variable to avoid having to configure the maximum open file limit on the virtual machines provisioned by AWS Fargate:
- name: SDC_FILE_LIMIT
  value: 0
Optional attribute
Add the following optional resources attribute to define the size of the virtual machines that AWS Fargate provisions. Set the values of the cpu and memory attributes as needed:
resources:
  limits:
    cpu: 500m
    memory: 2G
  requests:
    cpu: 200m
    memory: 2G
For example, to define a deployment only on AWS Fargate with EKS using the public StreamSets Data Collector Docker image, use the following sample YAML specification file. The additional attributes used by AWS Fargate are in bold:
apiVersion: apps/v1
kind: Deployment
metadata:
  name: datacollector-deployment
  namespace: <agentNamespace>
spec:
  replicas: 1
  selector:
    matchLabels:
      app: <deploymentLabel>
  template:
    metadata:
      labels:
        app : <deploymentLabel>
    spec:
      containers:
      - name : datacollector
        image: <privateImage>
        ports:
        - containerPort: 18630
        env:
        - name: HOST
          valueFrom:
            fieldRef:
              fieldPath: status.podIP
        - name: PORT0
          value: "18630"
        - name: SDC_FILE_LIMIT
          value: 0
        resources:
           limits:
             cpu: 500m
             memory: 2G
           requests:
             cpu: 200m
             memory: 2G

Step 4. Create a Legacy Deployment

After defining the deployment YAML specification file, use Control Hub to create a legacy deployment.

You can create multiple deployments for a single Provisioning Agent. For example, for the Provisioning Agent running in the production cluster, you might create one deployment dedicated to running jobs that read web server logs and another deployment dedicated to running jobs that read data from Google Cloud.

  1. In the Navigation panel, click Legacy Kubernetes > Legacy Deployments.
  2. Click the Add Deployment icon: .
  3. On the Add Deployment window, configure the following properties:
    Deployment Property Description
    Name Deployment name.
    Description Optional description.
    Agent Type Type of container orchestration framework where the Provisioning Agent runs.

    At this time, only Kubernetes is supported.

    Provisioning Agent Name of the Provisioning Agent that manages the deployment.
    Number of Data Collector Instances Number of Data Collector container instances to deploy.
    Data Collector Labels Label or labels to assign to all Data Collector containers provisioned by this deployment. Labels determine the group of Data Collectors that run a job.

    For more information about labels, see Labels.

  4. In the YAML Specification property, use one of the following methods to replace the sample lines with the deployment YAML specification file that you defined in the previous step:
    • Paste the content from your file into the property.
    • Click File, select the file you defined, and then click Open to upload the file into the property.
  5. Click Save.

Step 5. Start the Legacy Deployment

When you start a legacy deployment, the Provisioning Agent deploys the Data Collector containers to the Kubernetes cluster and starts each Data Collector container.

If you configured the Provisioning Agent for Kerberos authentication, the Provisioning Agent works with Kerberos to dynamically create and inject Kerberos credentials (a service principal and keytab) into each deployed Data Collector container.

The agent deploys each container to a Kubernetes pod. So if the deployment specifies three Data Collector instances, the agent deploys three containers to three Kubernetes pods.

During the startup of each Data Collector container, the Data Collector registers itself with Control Hub.

  1. On the Legacy Kubernetes > Legacy Deployments view, select the inactive deployment and then click the Start Deployment icon: .
    It can take the Provisioning Agent up to a minute to provision the Data Collector containers. When complete, Control Hub indicates that the deployment is active.
  2. To verify that the Data Collector containers were successfully registered and are up and running, click Set Up > Engines in the Navigation panel.