Granting the Spark Cluster Access to Transformer

When Transformer works with a Spark installation that runs on a cluster, the Spark cluster must be able to access Transformer to send the status, metrics, and offsets for running pipelines.

Granting the Spark cluster access to Transformer involves specifying a cluster callback URL that Spark uses to communicate with Transformer.

The steps you complete to grant the Spark cluster access to Transformer depend on the following deployment types:

Granting Access for a Self-Managed Deployment

Complete the following steps when the Transformer engine belongs to a self-managed deployment.

  1. In Control Hub, edit the deployment. In the Configure Engine section, click Advanced Configuration. Then, click Transformer Configuration.
  2. To specify a cluster callback URL, uncomment the transformer.driver.callback.url property and set it to the Transformer URL.

    For example, if using the default Transformer port on a host machine named myhost, define the property as follows:

    transformer.driver.callback.url=http://myhost:19630

    For more information about specifying a cluster callback URL or overriding the URL in individual pipelines, see Understanding the Spark Cluster Callback URL.

  3. Save the changes to the deployment and restart all engine instances.
  4. Grant the Spark cluster access to Transformer at this URL.

    For information about granting the Spark cluster access to other machines, see the documentation for your Spark vendor.

Granting Access for Cloud Service Provider Deployments

Complete the following steps when the Transformer engine belongs to a cloud service provider deployment, including an Amazon EC2, Azure VM, or GCE deployment.

  1. Locate the public IP address of the provisioned instance.
    1. Launch the deployment to provision the instance.
    2. Use the console for your cloud service provider to locate the provisioned instance.
    3. Copy the public IP address of the instance.
  2. In Control Hub, edit the deployment. In the Configure Engine section, click Advanced Configuration. Then, click Transformer Configuration.
  3. To specify a cluster callback URL, uncomment the transformer.driver.callback.url property and set it to the Transformer URL.
    For example, if using the default Transformer port on a host machine named myhost, define the property as follows:

    transformer.driver.callback.url=http://myhost:19630

    For more information about specifying a cluster callback URL or overriding the URL in individual pipelines, see Understanding the Spark Cluster Callback URL.

  4. Save the changes to the deployment and restart all engine instances.
  5. Grant the Spark cluster access to Transformer at this URL.

    For information about granting the Spark cluster access to other machines, see the documentation for your Spark vendor.

Granting Access for a Kubernetes Deployment

Granting the Spark cluster access to Transformer when using a Kubernetes deployment involves exposing the Transformer container outside the cluster using a Kubernetes service.

You can also optionally associate an Ingress with the service. An Ingress can provide load balancing, SSL termination, and name-based virtual hosting to the services in a Kubernetes cluster.

For more information, see the Kubernetes services and Ingress documentation.

  1. In Control Hub, edit the deployment. In the Configure Engine section, click Advanced Configuration. Then, click Transformer Configuration.
  2. To specify a cluster callback URL, uncomment the transformer.driver.callback.url property and set it to the Transformer URL.

    For example, if using the default Transformer port on a host machine named myhost, define the property as follows:

    transformer.driver.callback.url=http://myhost:19630

    For more information about specifying a cluster callback URL or overriding the URL in individual pipelines, see Understanding the Spark Cluster Callback URL.

  3. Click Save, and then click Save & Next.
  4. In the Configure Kubernetes Deployment section, click Advanced Mode.

    Make the following modifications to the generated YAML:

    1. Add a service definition to expose Transformer as a service in Kubernetes.

      For example, if using the default Transformer port, define the service as follows:

      apiVersion: v1
      kind: Service
      metadata:
        name: transformer-service
        namespace: <deploymentNamespace>
      spec:
        selector:
          app: <deploymentName>
        ports:
        - name: transformer-port
          protocol: TCP
          port: 19630
          targetPort: 19630
        clusterIP: None
      Replace the following variables with the appropriate attribute values:
      Variable Description
      deploymentNamespace Name of the Kubernetes namespace where the Transformer engine is deployed.

      Use the same namespace name included in the deployment definition in the generated YAML.

      deploymentName Name of the provisioned Kubernetes deployment, using the following format:

      streamsets-deployment-<Control_Hub_deployment_ID>

      Use the same deployment name included in the deployment definition in the generated YAML.

    2. Under the containers attribute in the deployment definition, add a containerPort attribute set to the Transformer port.

      For example, if using the default Transformer port, define the attribute as follows:

      ports:
          - containerPort: 19630
    3. In the deployment definition, either remove the dnsPolicy attribute or change the attribute value from Default to ClusterFirstWithHostNet.
    4. Optionally, define an Ingress in the YAML.
    The following sample YAML displays the required modifications in bold:
    apiVersion: v1
    kind: Service
    metadata:
      name: transformer-service
      namespace: streamsets
    spec:
      selector:
        app: streamsets-deployment-af79941c-ce31-42a5-aca4-16289259c2ff
      ports:
      - name: transformer-port
        protocol: TCP
        port: 19630
        targetPort: 19630
      clusterIP: None
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      labels:
        app: streamsets-deployment-af79941c-ce31-42a5-aca4-16289259c2ff
      name: streamsets-deployment-af79941c-ce31-42a5-aca4-16289259c2ff
      namespace: default
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: streamsets-deployment-af79941c-ce31-42a5-aca4-16289259c2ff
      template:
        metadata:
          labels:
            app: streamsets-deployment-af79941c-ce31-42a5-aca4-16289259c2ff
        spec:
          containers:
            - env:
                - name: STREAMSETS_DEPLOYMENT_ID
                  value: 7d66fre4-dd0e-4d39-8747-e1aaa31561fd:9143b710-04d2-11ec-b891-41da57d4f127
                - name: STREAMSETS_DEPLOYMENT_TOKEN
                  value: eyJ0eXAiOiJKV1QiLCJhbGci5lIn0.eyJzIjoiNjdiZWIzM2NkOTA1ZmCJhbGci5lIMyMWRmNjBmNTVhCJhbGci5lIMTM0NTY4MMDQxYU3OThZGECJhbGci5lI1N2Q0ZjEyNyJ9.
                - name: STREAMSETS_DEPLOYMENT_SCH_URL
                  value: https://na01.hub.streamsets.com
              image: streamsets/transformer:scala-2.12_5.3.0
              name: streamsets-deployment-af71641c-ce31-43a5-aca4-18288259c2ff
              ports:
                - containerPort: 19630
              resources:
                requests:
                  memory: 1Gi
                  cpu: 1
          dnsPolicy: ClusterFirstWithHostNet
  5. When you finish modifying the YAML, click Save & Next.
  6. Save the changes to the deployment and restart all engine instances.
  7. Grant the Spark cluster access to Transformer at this URL.

    For information about granting the Spark cluster access to other machines, see the documentation for your Spark vendor.