Granting the Spark Cluster Access to Transformer

When Transformer works with a Spark installation that runs on a cluster, the Spark cluster must be able to access Transformer to send the status, metrics, and offsets for running pipelines.

Note: Granting the Spark cluster access to Transformer involves configuring the default Transformer URL. When needed, you can configure a cluster callback URL for a pipeline to override the default URL.

  1. When using one of the cloud service provider integrations that StreamSets provides, such as an Amazon EC2 or a Google Compute Engine (GCE) deployment, locate the public IP address of the provisioned instance.
    1. Launch the deployment to provision the instance.
    2. Use the console for your cloud service provider to locate the provisioned instance.
    3. Copy the public IP address of the instance.
  2. In Control Hub, edit the deployment. In the Configure Engine section, click Advanced Configuration. Then, click Transformer Configuration.
  3. Uncomment the transformer.base.http.url property and set it to the Transformer URL.

    For example, if using an EC2 or GCE deployment with the default Transformer port, use the public IP address that you copied from the cloud service provider console to define the property as follows:

    transformer.base.http.url=http://<IP address>:19630

    If using a self-managed deployment with the default Transformer port on a host machine named myhost, define the property as follows:

    transformer.base.http.url=http://myhost:19630

    Important: If a self-managed Transformer runs on a cloud-computing platform, define the publicly accessible URL to that instance.
  4. Save the changes to the deployment and restart all engine instances.
  5. Grant the Spark cluster access to Transformer at this URL.

    For information about granting the Spark cluster access to other machines, see the documentation for your Spark vendor.