Failover Retries

When a Data Collector job is enabled for failover, Control Hub retries the failover an infinite number of times by default. If you want the failover to stop after a given number of retries, define the maximum number of retries to perform.

To determine the maximum number of retries, configure one or both of the following properties when you configure the job:
Failover Retries per Data Collector
Maximum number of pipeline failover retries to attempt on each available Data Collector. The initial start of a pipeline instance on a Data Collector counts as the first retry attempt.
Control Hub maintains the failover retry count for each available Data Collector. When a Data Collector reaches the maximum number of failover retries, Control Hub does not attempt to restart additional failed pipelines for the job on that Data Collector. This does not affect the retry counts for other Data Collectors running pipeline instances for the same job.
When this limit is reached for all available Data Collectors, Control Hub does not stop the job. Instead, the job remains in a red active status until another Data Collector becomes available to run the pipeline.
Global Failover Retries
Maximum number of pipeline failover retries to attempt across all available Data Collectors.
Control Hub maintains the global failover retry count across all available Data Collectors. When the maximum number of global failover retries is reached, Control Hub stops the job.
Tip: You can create a subscription that triggers an action when a job has exceeded the maximum number of global failover retries.

Control Hub increments the failover retry count and applies the retry limit only when the pipeline encounters an error and transitions to a Start_Error or Run_Error state. If the engine running the pipeline shuts down, failover always occurs and Control Hub does not increment the failover retry count.

Example for Failover Retries per Data Collector

Let's look at an example of how Control Hub maintains the Failover Retries per Data Collector property.

You enable a job for failover, set the number of pipeline instances to two, set the Failover Retries per Data Collector to two, and then start the job on a group of four Data Collectors. The job runs as follows:
  1. Control Hub sends one pipeline instance to Data Collector A and another to Data Collector B.

    Data Collector C and Data Collector D serve as backups.

  2. After some time, the pipeline on Data Collector A fails.
  3. Control Hub attempts to restart the failed pipeline on Data Collector C, but the failover attempt fails. Control Hub increments the failover attempt to one for Data Collector C, and then successfully restarts the failed pipeline on Data Collector D.
  4. After additional time, the pipeline on Data Collector B fails.
  5. Control Hub attempts to restart the failed pipeline on Data Collector C, but the failover attempt fails. Control Hub increments the failover attempt to two for Data Collector C, and then successfully restarts the failed pipeline on Data Collector A.

    Since Data Collector C has reached the maximum number of failover attempts, Control Hub does not attempt to restart additional pipelines for this job on Data Collector C.