Provision Data Collectors
Step 1. Create a Custom Image
Use Docker to customize the public StreamSets Data Collector Docker image as needed, and then store the private image in your private repository.
- Customized configuration files
- Resource files
- External libraries, such as JDBC drivers
- Custom stages
- Additional stage libraries - The public Data Collector Docker image includes the basic, development, and Windows stage libraries only.
- Packages and files required to enable Kerberos authentication for Data Collector:
- On Linux, the krb5-workstation and krb5-client Kerberos client packages.
- The Hadoop or HDFS configuration files required by the Kerberos-enabled
stage, for example:
- core-site.xml
- hdfs-site.xml
- yarn-site.xml
- mapred-site.xml
Each deployment managed by a Provisioning Agent specifies the Data Collector Docker image to deploy. So you can create a unique Data Collector Docker image for each deployment, or you can use one Docker image for all deployments.
For example, let's say that one deployment of provisioned Data Collectors reads from web server logs, so the Data Collector Docker image used by that deployment requires only the basic and statistics stage libraries. Another deployment of provisioned Data Collectors reads from the Google Cloud platform, so the Data Collector Docker image used by that deployment requires the Google Cloud stage library in addition to the basic and statistics stage libraries. You can create and manage two separate Data Collector Docker images for the deployments. Or you can create and manage a single image that meets the needs of both deployments.
For more information about running Data Collector from Docker, see https://hub.docker.com/r/streamsets/datacollector/.
For more information about creating private Docker images and publishing them to a private repository, see the Docker documentation.
Step 2. Create a Provisioning Agent
- Using Helm
- Helm is a tool that streamlines installing and managing Kubernetes applications.
- Without using Helm
- If you do not want to use Helm, you can define a Provisioning Agent YAML specification file, and then use Kubernetes commands to create and deploy the Provisioning Agent.
When you use either method, you can configure the Provisioning Agent to provision Data Collector containers enabled for Kerberos authentication. However, StreamSets recommends using Helm to enable Kerberos authentication.
Creating an Agent Using Helm
To create a Provisioning Agent using Helm, install Helm and download the Control Agent Helm chart that StreamSets provides. After modifying values in the Helm chart, use the Helm install command to create and deploy the Provisioning Agent as a containerized application to a Kubernetes pod.
Create one Provisioning Agent for each Kubernetes cluster where you want to provision Data Collectors. For example, if you have a production cluster and a disaster recovery cluster, you would create a total of two Provisioning Agents - one for each cluster.
-
Install Helm.
For Helm download and installation instructions, see the Helm project on GitHub.
-
After installing Helm, download the Control Agent Helm chart from the StreamSets Helm Charts GitHub
repository.
A Helm chart is a collection of files that describe a related set of Kubernetes resources. After you download the StreamSets Control Agent Helm chart, you'll have a set of files in the following directory, where
<chart_directory>
is the root directory of the downloaded chart:<chart_directory>/control-agent
For more information about Helm charts, see the Helm documentation.
-
Complete the following steps in Control Hub to generate the authentication token for the Provisioning Agent:
- In the Navigation panel, click .
- Click the Generate Authentication Tokens icon .
- Click Generate.
- Copy the token from the window and paste it in a local text file so that you can access it when you modify the YAML specification in the next step.
- Open the <chart_directory>/control-agent/values.yaml file included with the StreamSets Control Agent Helm chart.
-
Complete the following steps to modify the file:
-
To enable Kerberos authentication for the provisioned Data Collector containers, create a folder named krb under the
<chart_directory>/control-agent
folder. Then, copy the Kerberos configuration file, krb5.conf, to this folder.The krb5.conf file contains Kerberos configuration information, including the locations of key distribution centers (KDCs) and admin servers for the Kerberos realms, defaults for the current realm, and mappings of hostnames onto Kerberos realms.
-
Run the following Helm command to install the chart:
helm install streamsets/control-agent
The command creates and deploys the Provisioning Agent as a containerized application to a Kubernetes pod. When the Provisioning Agent starts, it uses the authentication token in the values YAML file to register itself with Control Hub.
-
To verify that the Provisioning Agent is up and running and successfully
registered with Control Hub, click in the Control Hub Navigation panel.
The Provisioning Agents view displays all registered Provisioning Agents.
Creating an Agent without Using Helm
To create a Provisioning Agent without using Helm, configure a Provisioning Agent YAML specification file, and then use the Kubernetes create command to create and deploy the Provisioning Agent as a containerized application to a Kubernetes pod.
Create one Provisioning Agent for each Kubernetes cluster where you want to provision Data Collectors. For example, if you have a production cluster and a disaster recovery cluster, you would create a total of two Provisioning Agents - one for each cluster.
-
Use Kubernetes to create a secret in the Kubernetes namespace where you plan to
run the Provisioning Agent.
On the Kubernetes cluster, run the following command, where
<secretName>
is the name of the secret and<agentNamespace>
is the namespace where you plan to run the Provisioning Agent:kubectl create secret generic <secretName> --namespace=<agentNamespace>
You must create a secret for the Provisioning Agent so that if the Provisioning Agent fails over to another Kubernetes pod, the agent can continue to manage the Data Collector containers that it already deployed. Each Provisioning Agent requires a unique secret.
For more information about creating Kubernetes secrets, see the Kubernetes documentation.
-
Complete the following steps in Control Hub to generate the authentication token for the Provisioning Agent:
- In the Control Hub Navigation panel, click .
- Click the Generate Authentication Tokens icon .
- Click Generate.
- Copy the token from the window and paste it in a local text file so that you can access it when you modify the YAML specification in the next step.
-
Copy the following lines of code into a YAML specification file:
apiVersion: apps/v1 kind: Deployment metadata: name: <agentName> namespace: <agentNamespace> spec: replicas: 1 selector: matchLabels: app: agent template: metadata: labels: app : agent spec: volumes: - name: krb5conf secret: secretName: krb5conf containers: - name : <agentName> image: streamsets/control-agent:<agentVersion> volumeMounts: //Mount - name: krb5conf mountPath: "/opt/kerberos/krb5.conf" env: - name: HOST valueFrom: fieldRef: fieldPath: status.podIP - name : dpm_agent_master_url value: <kubernetesMasterUrl> - name : dpm_agent_cof_type value: "KUBERNETES" - name : dpm_agent_dpm_baseurl value : <schBaseUrl> - name : dpm_agent_component_id value : <agentComponentId> - name : dpm_agent_token_string value : <agentTokenString> - name : dpm_agent_name value : <agentName> - name : dpm_agent_orgId value : <schOrgId> - name: dpm_agent_kerberos_enabled value: "true" - name: KRB5_CONFIG value: "/opt/kerberos/krb5.conf" - name: dpm_agent_kerberos_secret value: <kerbsecret> - name: dpm_agent_kdc_type value: <AD|MIT> - name : dpm_agent_secret value : <secretName>
-
If you are not enabling Kerberos authentication, remove the following Kerberos
attributes from the file:
... volumes: - name: krb5conf secret: secretName: krb5conf ... volumeMounts: //Mount - name: krb5conf mountPath: "/opt/kerberos/krb5.conf" ... - name: dpm_agent_kerberos_enabled value: "true" - name: KRB5_CONFIG value: "/opt/kerberos/krb5.conf" - name: dpm_agent_kerberos_secret value: <kerbsecret> - name: dpm_agent_kdc_type value: <AD|MIT>
-
If you are enabling Kerberos authentication, create a secret named
krb5conf for the Kerberos configuration file,
krb5.conf.
The krb5.conf file contains Kerberos configuration information, including the locations of key distribution centers (KDCs) and admin servers for the Kerberos realms, defaults for the current realm, and mappings of hostnames onto Kerberos realms.
-
Replace the following variables in the file with the appropriate attribute
values:
Variable Description agentName Name of the Provisioning Agent. agentVersion Version of the StreamSets Control Agent Docker image. Use latest
or version 5.0.0 or later. For example:image: streamsets/control-agent:latest
agentNamespace Namespace of the Provisioning Agent in Kubernetes. Use the same namespace to create deployments for this Provisioning Agent.
kubernetesMasterUrl Kubernetes master URL for your Kubernetes cluster. schBaseUrl URL to access Control Hub. Set to the Control Hub URL provided by your system administrator. For example,
https://<hostname>:18631
.agentComponentId Unique ID for this Provisioning Agent within Control Hub. For example, use
agent_<organizationID>
if your organization requires a single Provisioning Agent. Or useagentprod_<organizationID>
andagentrecovery_<organizationID>
if your organization requires one agent for a production cluster and another agent for a disaster recovery cluster.agentTokenString Authentication token that you generated for the Provisioning Agent in step 2. schOrgId Control Hub organization ID. kerbsecret Optional. If enabling Kerberos authentication, the secret used for Kerberos authentication that contains the following values: - encryption_types
- container_dn, if using Active Directory
- ldap_url, if using Active Directory
- admin_principal
- admin_key
AD|MIT Optional. If enabling Kerberos authentication, the authentication type for the Kerberos key distribution center: Active Directory or MIT Kerberos. secretName Secret name that you created for the Provisioning Agent in the Kubernetes namespace in step 1. -
Save the YAML specification with an appropriate file name, for example:
schAgent.yml
. -
On the Kubernetes cluster, run the following command, where
<fileName>
is the name of your saved YAML file:kubectl create -f <fileName>.yml
The command creates and deploys the Provisioning Agent as a containerized application to a Kubernetes pod. When the Provisioning Agent starts, it uses the authentication token in the YAML file to register itself with Control Hub.
-
To verify that the Provisioning Agent is up and running and successfully
registered with Control Hub, click in the Navigation panel.
The Provisioning Agents view displays all registered Provisioning Agents.
Step 3. Define a Deployment YAML Specification
Define a deployment in a YAML specification file. Each file can define a single deployment. The file can optionally define a Kubernetes Horizontal Pod Autoscaler associated with the deployment.
In most cases, you define a single Data Collector container within a deployment specification. To define multiple containers, use the StreamSets Control Agent Docker image version 5.1.0 or later and define the Data Collector container as the first element in the list of containers.
apps/v1
to define each deployment.The YAML specification file can contain the following components:
- Deployment
- Use for a deployment of one or more Data Collectors that can be manually scaled. To manually scale a deployment, you modify a deployment in the Control Hub UI to increase the number of Data Collector instances.
- Deployment associated with a Kubernetes Horizontal Pod Autoscaler
- Use for a deployment of one or more Data Collectors that must automatically scale during times of peak performance. Define the deployment and Horizontal Pod Autoscaler in the same YAML specification file. The Kubernetes Horizontal Pod Autoscaler automatically scales the deployment based on CPU utilization. For more information, see the Kubernetes Horizontal Pod Autoscaler documentation.
Deployment Sample
Define only a deployment in the YAML specification file when creating a deployment for one or more Data Collectors that can be manually scaled.
apiVersion: apps/v1
kind: Deployment
metadata:
name: datacollector-deployment
namespace: <agentNamespace>
spec:
replicas: 1
selector:
matchLabels:
app: <deploymentLabel>
template:
metadata:
labels:
app : <deploymentLabel>
kerberosEnabled: true
krbPrincipal: <KerberosUser>
spec:
containers:
- name : datacollector
image: <privateImage>
ports:
- containerPort: 18630
volumeMounts:
- name: krb5conf
mountPath: /etc/krb5.conf
subPath: krb5.conf
readOnly: true
env:
- name: HOST
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: PORT0
value: "18630"
imagePullSecrets:
- name: <imagePullSecrets>
volumes:
- name: krb5conf
secret:
secretName: krb5conf
...
kerberosEnabled: true
krbPrincipal: <KerberosUser>
...
volumeMounts:
- name: krb5conf
mountPath: /etc/krb5.conf
subPath: krb5.conf
readOnly: true
...
volumes:
- name: krb5conf
secret:
secretName: krb5conf
Variable | Description |
---|---|
agentNamespace | Namespace used for the Provisioning Agent that manages this deployment. |
deploymentLabel | Label for this deployment. Must be unique for all deployments managed by the Provisioning Agent. |
KerberosUser | User for the Kerberos principal when enabling Kerberos
authentication. This attribute is optional. If you remove this
attribute, the Provisioning Agent uses The Provisioning Agent creates a unique
Kerberos principal for each deployed Data Collector
container using the following format:
For
example, if you define the KerberosUser attribute
as marketing and the Provisioning Agent deploys two
Data Collector
containers, the agent creates the following Kerberos
principals:
|
privateImage | Path to your private Data Collector Docker
image stored in your private repository. Or, if using the public StreamSets
Data Collector Docker
image, modify the attribute as
follows:
Where <version> is the Data Collector
version. For
example:
|
imagePullSecrets | Pull secrets required for the private image stored in your private
repository. If using the public StreamSets Data Collector Docker image, remove these lines. |
Deployment and Horizontal Pod Autoscaler Sample
Define a deployment and Horizontal Pod Autoscaler in the YAML specification file when creating a deployment for one or more Data Collectors that automatically scale during times of peak performance.
apiVersion: v1
kind: List
items:
- apiVersion: apps/v1
kind: Deployment
metadata:
name: datacollector-deployment
namespace: <agentNamespace>
spec:
replicas: 1
selector:
matchLabels:
app: <deploymentLabel>
template:
metadata:
labels:
app : <deploymentLabel>
kerberosEnabled: true
krbPrincipal: <KerberosUser>
spec:
containers:
- name : datacollector
image: <privateImage>
ports:
- containerPort: 18630
volumeMounts:
- name: krb5conf
mountPath: /etc/krb5.conf
subPath: krb5.conf
readOnly: true
env:
- name: HOST
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: PORT0
value: "18630"
imagePullSecrets:
- name: <imagePullSecrets>
volumes:
- name: krb5conf
secret:
secretName: krb5conf
- apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: datacollector-hpa
namespace: <agentNamespace>
spec:
scaleTargetRef:
apiVersion: apps/v1beta1
kind: Deployment
name: <deploymentLabel>
minReplicas: 1
maxReplicas: 10
targetCPUUtilizationPercentage: 50
...
kerberosEnabled: true
krbPrincipal: <KerberosUser>
...
volumeMounts:
- name: krb5conf
mountPath: /etc/krb5.conf
subPath: krb5.conf
readOnly: true
...
volumes:
- name: krb5conf
secret:
secretName: krb5conf
Variable | Description |
---|---|
agentNamespace | Namespace used for the Provisioning Agent that manages this deployment. |
deploymentLabel | Label for this deployment. Must be unique for all deployments managed by the Provisioning Agent. |
KerberosUser | User for the Kerberos principal when enabling Kerberos
authentication. This attribute is optional. If you remove this
attribute, the Provisioning Agent uses The Provisioning Agent creates a unique
Kerberos principal for each deployed Data Collector
container using the following format:
For
example, if you define the KerberosUser attribute
as marketing and the Provisioning Agent deploys two
Data Collector
containers, the agent creates the following Kerberos
principals:
|
privateImage | Path to your private Data Collector Docker
image stored in your private repository. Or, if using the public StreamSets
Data Collector Docker
image, modify the attribute as
follows:
Where <version> is the Data Collector
version. For
example:
|
imagePullSecrets | Pull secrets required for the private image stored in your private
repository. If using the public StreamSets Data Collector Docker image, remove these lines. |
kind: Deployment
name: <deploymentLabel>
In the Horizontal Pod Autoscaler definition, you also might want to modify the minimum and maximum replica values and the target CPU utilization percentage value. For more information on these values, see the Kubernetes Horizontal Pod Autoscaler documentation.
Attributes for AWS Fargate with EKS
When provisioning Data Collectors to AWS Fargate with Amazon Elastic Kubernetes Service (EKS), add the following additional attributes to the deployment YAML specification file:
- Required attribute
- Add the following required environment variable to avoid having to configure the
maximum open file limit on the virtual machines provisioned by AWS Fargate:
- name: SDC_FILE_LIMIT value: 0
- Optional attribute
- Add the following optional
resources
attribute to define the size of the virtual machines that AWS Fargate provisions. Set the values of thecpu
andmemory
attributes as needed:resources: limits: cpu: 500m memory: 2G requests: cpu: 200m memory: 2G
apiVersion: apps/v1
kind: Deployment
metadata:
name: datacollector-deployment
namespace: <agentNamespace>
spec:
replicas: 1
selector:
matchLabels:
app: <deploymentLabel>
template:
metadata:
labels:
app : <deploymentLabel>
spec:
containers:
- name : datacollector
image: <privateImage>
ports:
- containerPort: 18630
env:
- name: HOST
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: PORT0
value: "18630"
- name: SDC_FILE_LIMIT
value: 0
resources:
limits:
cpu: 500m
memory: 2G
requests:
cpu: 200m
memory: 2G
Step 4. Create a Deployment
After defining the deployment YAML specification file, use Control Hub to create a deployment.
You can create multiple deployments for a single Provisioning Agent. For example, for the Provisioning Agent running in the production cluster, you might create one deployment dedicated to running jobs that read web server logs and another deployment dedicated to running jobs that read data from Google Cloud.
- In the Navigation panel, click .
- Click the Add Deployment icon: .
-
On the Add Deployment window, configure the following
properties:
Deployment Property Description Name Deployment name. Description Optional description. Agent Type Type of container orchestration framework where the Provisioning Agent runs. At this time, only Kubernetes is supported.
Provisioning Agent Name of the Provisioning Agent that manages the deployment. Number of Data Collector Instances Number of Data Collector container instances to deploy. Data Collector Labels Label or labels to assign to all Data Collector containers provisioned by this deployment. Labels determine the group of Data Collectors that run a job. For more information about labels, see Labels.
-
In the YAML Specification property, use one of the
following methods to replace the sample lines with the deployment YAML
specification file that you defined in the previous step:
- Paste the content from your file into the property.
- Click File, select the file you defined, and then click Open to upload the file into the property.
- Click Save.
Step 5. Start the Deployment
When you start a deployment, the Provisioning Agent deploys the Data Collector containers to the Kubernetes cluster and starts each Data Collector container.
If you configured the Provisioning Agent for Kerberos authentication, the Provisioning Agent works with Kerberos to dynamically create and inject Kerberos credentials (a service principal and keytab) into each deployed Data Collector container.
The agent deploys each container to a Kubernetes pod. So if the deployment specifies three Data Collector instances, the agent deploys three containers to three Kubernetes pods.
During the startup of each Data Collector container, the Data Collector registers itself with Control Hub.
-
On the Start Deployment icon: .
view, select the inactive deployment and then click the
It can take the Provisioning Agent up to a minute to provision the Data Collector containers. When complete, Control Hub indicates that the deployment is active.
-
To verify that the Data Collector containers were successfully registered and are up and running, click
in the Navigation panel.
The Data Collectors view displays all registered Data Collectors - either manually administered or automatically provisioned.