Transformer Communication

StreamSets Control Hub works with Transformer to design pipelines and to execute Transformer pipelines on Apache Spark, an open-source cluster-computing framework.

Control Hub runs on a public cloud service hosted by StreamSets - you simply need an account to get started. You install Transformer on a machine that is configured to submit Spark jobs to a cluster, such as a Hadoop edge or data node or a cloud virtual machine. You then register Transformer to work with Control Hub.

You can install and register multiple instances of Transformer with Control Hub. For example, you might install multiple instances of Transformer to work with different Hadoop YARN clusters. Or you might use one Transformer installation as a test environment and another installation as a production environment.

You can use each registered Transformer for both authoring and execution in Control Hub. You design pipelines in the Control Hub Pipeline Designer after selecting an available authoring Transformer to use. When you run pipelines from Control Hub jobs, the labels assigned to the jobs and to the Transformers determine the execution Transformer that runs the pipeline. Transformer submits the pipeline as a Spark application to the cluster, and then Spark handles all of the pipeline processing.

Registered Transformers use encrypted REST APIs to communicate with Control Hub. Transformers initiate outbound connections to Control Hub over HTTPS on port number 443.

The web browser that accesses Control Hub Pipeline Designer uses encrypted REST APIs to communicate with Control Hub. The web browser initiates outbound connections to Control Hub over HTTPS on port number 443.

The authoring Transformer selected for Pipeline Designer accepts inbound connections from the web browser on the port number configured for the Transformer.

Similarly, the execution Transformer accepts inbound connections from Spark as it processes the pipeline and sends metrics, last-saved offsets, and pipeline status back to Transformer. The execution Transformer also accepts inbound connections from the web browser when you monitor real-time summary statistics for active jobs.

The following image shows how Transformers communicate with Control Hub:

Transformer Requests

Registered Transformers send requests and information to Control Hub.

Control Hub does not directly send requests to Transformers. Instead, Control Hub sends requests using encrypted REST APIs to a messaging queue managed by Control Hub. Transformers periodically check with the queue to retrieve Control Hub requests.

Transformers communicate with Control Hub in the following areas:

Pipeline management

When you use an authoring Transformer to publish a pipeline to Control Hub, the Transformer sends the request to Control Hub.

Connections

When you start a job for a pipeline that uses a connection, the execution Transformer requests the connection properties from Control Hub.

Jobs

Every minute, Transformers send a heartbeat, the last-saved offsets, and the status of all remotely running pipelines to Control Hub so that Control Hub can manage job execution.

Note: Transformer version 3.12.0 and earlier sends this information to the messaging queue.

Security

When you enable Control Hub within a Transformer or when a user logs into a registered Transformer, the Transformer makes an authentication request to Control Hub.

Metrics

Every minute, an execution Transformer sends aggregated metrics for remotely running pipelines to Control Hub.

Messaging queue

Transformers send the following information to the messaging queue:

At startup, a Transformer sends the following information: Transformer version, URL of the Transformer, and labels configured in the Control Hub configuration file, $TRANSFORMER_CONF/dpm.properties.
When you update permissions on local pipelines, the Transformer sends the updated pipeline permissions.

Every five seconds, Transformers check with the messaging queue to retrieve requests and information sent by Control Hub. When you start, stop, or delete a job, Control Hub sends a pipeline request for specific execution Transformers to the messaging queue. The messaging queue retains the request until the receiving Transformers retrieve them.