Transformer Pipeline Failover

You can enable a Transformer job for pipeline failover for some cluster types. Enable pipeline failover to prevent Spark applications from failing due to an unexpected Transformer shutdown.

Important: At this time, failover is supported for jobs that include pipelines configured to run on an Amazon EMR, Databricks, or Google Dataproc cluster. Failover is not supported for other cluster types.

When you start a Transformer job, Control Hub sends an instance of the pipeline to one Transformer with all labels specified for the job. Transformer remotely runs the pipeline instance on Apache Spark deployed to a cluster. Spark runs the application just as it runs any other application, distributing the processing across nodes in the cluster and automatically handling failover in the cluster as needed.

As the pipeline runs, Spark sends Transformer the status, metrics, and offsets for the running pipeline. Transformer then passes this information to Control Hub. If Transformer unexpectedly shuts down, Spark continues to run the application and attempts to reconnect to Transformer for several minutes. If Spark cannot reconnect to Transformer before Control Hub considers the engine as unresponsive, the Spark application fails.

When a job is enabled for failover, Control Hub can reassign the job to a backup Transformer when the initial Transformer becomes unresponsive. In this case, Spark continues to run the application and attempts to reconnect to Transformer for 10 minutes by default, twice the amount of time configured in the execution engine heartbeat interval. If Spark can reconnect to an available backup Transformer during this time, Spark continues running the application and sends all information about the running pipeline to the backup Transformer, resulting in no loss of processing. If Spark cannot reconnect to a backup Transformer during this time, the Spark application fails.

An available Transformer includes any Transformer that is assigned all labels specified for the job and that has not exceeded any resource thresholds. When multiple Transformers are available as a backup, Control Hub prioritizes Transformers that are currently running the fewest number of pipelines.

Failover and Backup Transformers

When you enable a Transformer job for failover, define a group of at least two Transformers that the job can start on. That way, you reserve a backup Transformer for pipeline failover. Define a group of Transformers by assigning the same labels to multiple Transformers.

For example, you want to run a job on the group of two Transformers assigned the WesternRegion label. You assign the WesternRegion label to the job, and enable failover for the job. When you start the job, Control Hub sends a single pipeline instance to the Transformer currently running the fewest number of pipelines. The second Transformer serves as a backup and is available for pipeline failover if the first Transformer unexpectedly shuts down.

Failover Requirements

Note the following requirements to use pipeline failover for a Transformer job:
Transformer version
Use an execution Transformer version 3.17.0 or later when enabling a job for failover on an Amazon EMR or Google Dataproc cluster. Use an execution Transformer version 4.0.0 or later when enabling a job for failover on a Databricks cluster.
Earlier Transformer versions do not support pipeline failover. If you start a job enabled for failover on earlier Transformer versions, the job remains in a red active status until a supported Transformer version becomes available.
Supported cluster types
The pipeline included in the job must be configured to run on an Amazon EMR, Databricks, or Google Dataproc cluster. Control Hub will provide failover support for additional cluster types in a future release.
No value defined for the Cluster Callback URL pipeline property
The pipeline must not have a value defined for the Cluster Callback URL property on the pipeline Advanced tab.
When you define the Cluster Callback URL property, you hard-code the Transformer URL for the pipeline to a single Transformer instance, overriding the URL configured in the Transformer configuration file, $TRANSFORMER_CONF/transformer.properties. To support pipeline failover to a backup Transformer, the Spark cluster must be able to communicate with each Transformer instance using the URL configured in the Transformer configuration file.
The Cluster Callback URL is an advanced pipeline property, and does not need to be defined in most cases. For more information, see Cluster Callback URL.

Failover Retries

When a Transformer job is enabled for failover, Control Hub retries the failover an infinite number of times by default. If you want the failover to stop after a given number of retries, define the maximum number of retries to perform.

To determine the maximum number of retries, configure the Global Failover Retries property. Control Hub maintains the global failover retry count across all available Transformers. When the maximum number of global failover retries is reached, Control Hub stops the job.

Tip: You can create a subscription that triggers an action when a job has exceeded the maximum number of global failover retries.

Enabling Pipeline Failover

You can configure a Transformer job for pipeline failover when the pipeline meets the failover requirements. Enable pipeline failover when you create a job or when you edit an inactive job.

Important: Before you enable pipeline failover for a job, first define a group of at least two Transformers that the job can start on. That way, you reserve a backup Transformer for pipeline failover.

To enable pipeline failover when you edit an inactive job:

  1. In the Navigation panel, click Jobs.
  2. Hover over the inactive Transformer job, and click the Edit icon: .
  3. Select the Enable Failover property.
  4. Optionally, set the Global Failover Retries property to the maximum number of retries to attempt across all available Transformers.

    When the maximum number of global failover retries is reached, Control Hub stops the job. Use -1 to retry indefinitely.

  5. Click Save.