Granting the Spark Cluster Access to Transformer

When Transformer works with a Spark installation that runs on a cluster, the Spark cluster must be able to access Transformer to send the status, metrics, and offsets for running pipelines.

Granting the Spark cluster access to Transformer involves specifying a cluster callback URL that Spark uses to communicate with Transformer.

The steps you complete to grant the Spark cluster access to Transformer depend on the following deployment types:

Self-managed deployment
Cloud service provider deployment, including an Amazon EC2, Azure VM, or GCE deployment
Kubernetes deployment

Granting Access for a Self-Managed Deployment

Complete the following steps when the Transformer engine belongs to a self-managed deployment.

In Control Hub, edit the deployment. In the Configure Engine section, click Advanced Configuration. Then, click Transformer Configuration.
To specify a cluster callback URL, uncomment the transformer.driver.callback.url property and set it to the Transformer URL.

For example, if using the default Transformer port on a host machine named myhost, define the property as follows:

transformer.driver.callback.url=http://myhost:19630

For more information about specifying a cluster callback URL or overriding the URL in individual pipelines, see Understanding the Spark Cluster Callback URL.
Save the changes to the deployment and restart all engine instances.
Grant the Spark cluster access to Transformer at this URL.

For information about granting the Spark cluster access to other machines, see the documentation for your Spark vendor.

Granting Access for Cloud Service Provider Deployments

Complete the following steps when the Transformer engine belongs to a cloud service provider deployment, including an Amazon EC2, Azure VM, or GCE deployment.

Locate the public IP address of the provisioned instance.
1. Launch the deployment to provision the instance.
2. Use the console for your cloud service provider to locate the provisioned instance.
3. Copy the public IP address of the instance.
In Control Hub, edit the deployment. In the Configure Engine section, click Advanced Configuration. Then, click Transformer Configuration.
To specify a cluster callback URL, uncomment the transformer.driver.callback.url property and set it to the Transformer URL.
For example, if using the default Transformer port on a host machine named myhost, define the property as follows:
transformer.driver.callback.url=http://myhost:19630
For more information about specifying a cluster callback URL or overriding the URL in individual pipelines, see Understanding the Spark Cluster Callback URL.
Save the changes to the deployment and restart all engine instances.
Grant the Spark cluster access to Transformer at this URL.

For information about granting the Spark cluster access to other machines, see the documentation for your Spark vendor.

Granting Access for a Kubernetes Deployment

Granting the Spark cluster access to Transformer when using a Kubernetes deployment involves exposing the Transformer container outside the cluster using a Kubernetes service.

You can also optionally associate an Ingress with the service. An Ingress can provide load balancing, SSL termination, and name-based virtual hosting to the services in a Kubernetes cluster.

For more information, see the Kubernetes services and Ingress documentation.

In Control Hub, edit the deployment. In the Configure Engine section, click Advanced Configuration. Then, click Transformer Configuration.
To specify a cluster callback URL, uncomment the transformer.driver.callback.url property and set it to the Transformer URL.

For example, if using the default Transformer port on a host machine named myhost, define the property as follows:

transformer.driver.callback.url=http://myhost:19630

For more information about specifying a cluster callback URL or overriding the URL in individual pipelines, see Understanding the Spark Cluster Callback URL.
Click Save, and then click Save & Next.

In the Configure Kubernetes Deployment section, click Advanced Mode.

Make the following modifications to the generated YAML:

Add a service definition to expose Transformer as a service in Kubernetes.

For example, if using the default Transformer port, define the service as follows:

apiVersion: v1
kind: Service
metadata:
  name: transformer-service
  namespace: <deploymentNamespace>
spec:
  selector:
    app: <deploymentName>
  ports:
  - name: transformer-port
    protocol: TCP
    port: 19630
    targetPort: 19630
  clusterIP: None

Replace the following variables with the appropriate attribute values:


Variable	Description
deploymentNamespace	Name of the Kubernetes namespace where the Transformer engine is deployed. Use the same namespace name included in the deployment definition in the generated YAML.
deploymentName	Name of the provisioned Kubernetes deployment, using the following format: `streamsets-deployment-<Control_Hub_deployment_ID>` Use the same deployment name included in the deployment definition in the generated YAML.

Under the containers attribute in the deployment definition, add a containerPort attribute set to the Transformer port.

For example, if using the default Transformer port, define the attribute as follows:
```
ports:
    - containerPort: 19630
```
In the deployment definition, either remove the dnsPolicy attribute or change the attribute value from Default to ClusterFirstWithHostNet.
Optionally, define an Ingress in the YAML.

The following sample YAML displays the required modifications in bold:

apiVersion: v1
kind: Service
metadata:
  name: transformer-service
  namespace: streamsets
spec:
  selector:
    app: streamsets-deployment-af79941c-ce31-42a5-aca4-16289259c2ff
  ports:
  - name: transformer-port
    protocol: TCP
    port: 19630
    targetPort: 19630
  clusterIP: None
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: streamsets-deployment-af79941c-ce31-42a5-aca4-16289259c2ff
  name: streamsets-deployment-af79941c-ce31-42a5-aca4-16289259c2ff
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: streamsets-deployment-af79941c-ce31-42a5-aca4-16289259c2ff
  template:
    metadata:
      labels:
        app: streamsets-deployment-af79941c-ce31-42a5-aca4-16289259c2ff
    spec:
      containers:
        - env:
            - name: STREAMSETS_DEPLOYMENT_ID
              value: 7d66fre4-dd0e-4d39-8747-e1aaa31561fd:9143b710-04d2-11ec-b891-41da57d4f127
            - name: STREAMSETS_DEPLOYMENT_TOKEN
              value: eyJ0eXAiOiJKV1QiLCJhbGci5lIn0.eyJzIjoiNjdiZWIzM2NkOTA1ZmCJhbGci5lIMyMWRmNjBmNTVhCJhbGci5lIMTM0NTY4MMDQxYU3OThZGECJhbGci5lI1N2Q0ZjEyNyJ9.
            - name: STREAMSETS_DEPLOYMENT_SCH_URL
              value: https://na01.hub.streamsets.com
          image: streamsets/transformer:scala-2.12_5.3.0
          name: streamsets-deployment-af71641c-ce31-43a5-aca4-18288259c2ff
          ports:
            - containerPort: 19630
          resources:
            requests:
              memory: 1Gi
              cpu: 1
      dnsPolicy: ClusterFirstWithHostNet

When you finish modifying the YAML, click Save & Next.
Save the changes to the deployment and restart all engine instances.
Grant the Spark cluster access to Transformer at this URL.

For information about granting the Spark cluster access to other machines, see the documentation for your Spark vendor.