Data Collector Pipeline Failover

You can enable a Data Collector job for pipeline failover. Enable pipeline failover to minimize downtime due to unexpected pipeline failures and to help you achieve high availability.

When a job is enabled for failover, Control Hub can restart a failed pipeline on another available execution engine that is assigned all labels specified for the job, starting from the last-saved offset.

Control Hub restarts a failed Data Collector pipeline in the following situations:

The pipeline has reached the maximum number of retry attempts after encountering an error and has transitioned to a Start_Error or Run_Error state.
The Data Collector running the pipeline has been unresponsive for the maximum engine heartbeat interval.
A Data Collector can become unresponsive because it shuts down or because it cannot connect to Control Hub due to a network or system outage.

An available Data Collector includes any Data Collector that is assigned all labels specified for the job, that is not currently running a pipeline instance for the job, and that has not exceeded any resource thresholds. When multiple Data Collectors are available, Control Hub prioritizes Data Collectors that have not previously failed the pipeline and Data Collectors that are currently running the fewest number of pipelines.

For example, you enable a job for failover, set the number of pipeline instances to one, and then start the job on a group of three Data Collectors. Control Hub initially sends the pipeline instance to Data Collector A, but the pipeline fails on Data Collector A. At the time of failover, Data Collector A is running no other pipelines, Data Collector B is running one other pipeline, and Data Collector C is running two other pipelines. Control Hub restarts the failed pipeline on Data Collector B. If all three Data Collectors had already failed the pipeline, then Control Hub would restart the failed pipeline on Data Collector A.

Failover and Number of Pipeline Instances

When you enable a Data Collector job for failover, set the number of pipeline instances that the job runs to a value less than the number of available Data Collectors. This reserves Data Collectors for pipeline failover. As a best practice, reserve at least two Data Collectors as backups.

For example, you want to run a job on the group of four Data Collectors assigned the WesternRegion label, and want to reserve two of the Data Collectors for pipeline failover. You assign the WesternRegion label to the job, enable failover for the job, and set the Number of Instances property to two. When you start the job, Control Hub sends pipeline instances to the two Data Collectors currently running the fewest number of pipelines. The third and fourth Data Collectors serve as backups and are available for pipeline failover if another Data Collector shuts down or a pipeline encounters an error.

For more information about configuring the number of pipeline instances for a job, see Number of Pipeline Instances.

Failover Retries

When a Data Collector job is enabled for failover, Control Hub retries the failover an infinite number of times by default. If you want the failover to stop after a given number of retries, define the maximum number of retries to perform.

To determine the maximum number of retries, configure one or both of the following properties when you configure the job:

Failover Retries per Data Collector: Maximum number of pipeline failover retries to attempt on each available Data Collector. The initial start of a pipeline instance on a Data Collector counts as the first retry attempt.; Control Hub maintains the failover retry count for each available Data Collector. When a Data Collector reaches the maximum number of failover retries, Control Hub does not attempt to restart additional failed pipelines for the job on that Data Collector. This does not affect the retry counts for other Data Collectors running pipeline instances for the same job.; When this limit is reached for all available Data Collectors, Control Hub does not stop the job. Instead, the job remains in a red active status until another Data Collector becomes available to run the pipeline.
Global Failover Retries: Maximum number of pipeline failover retries to attempt across all available Data Collectors.; Control Hub maintains the global failover retry count across all available Data Collectors. When the maximum number of global failover retries is reached, Control Hub stops the job.
Tip: You can create a subscription that triggers an action when a job has exceeded the maximum number of global failover retries.

Control Hub increments the failover retry count and applies the retry limit only when the pipeline encounters an error and transitions to a Start_Error or Run_Error state. If the engine running the pipeline shuts down, failover always occurs and Control Hub does not increment the failover retry count.

Example for Failover Retries per Data Collector

Let's look at an example of how Control Hub maintains the Failover Retries per Data Collector property.

You enable a job for failover, set the number of pipeline instances to two, set the Failover Retries per Data Collector to two, and then start the job on a group of four Data Collectors. The job runs as follows:

Control Hub sends one pipeline instance to Data Collector A and another to Data Collector B.
Data Collector C and Data Collector D serve as backups.
After some time, the pipeline on Data Collector A fails.
Control Hub attempts to restart the failed pipeline on Data Collector C, but the failover attempt fails. Control Hub increments the failover attempt to one for Data Collector C, and then successfully restarts the failed pipeline on Data Collector D.
After additional time, the pipeline on Data Collector B fails.
Control Hub attempts to restart the failed pipeline on Data Collector C, but the failover attempt fails. Control Hub increments the failover attempt to two for Data Collector C, and then successfully restarts the failed pipeline on Data Collector A.
Since Data Collector C has reached the maximum number of failover attempts, Control Hub does not attempt to restart additional pipelines for this job on Data Collector C.

Determining When to Enable Failover

Consider the pipeline origin when deciding whether to enable failover for a Data Collector job.

Enable failover for a job when the external origin system maintains the offset, such as Kafka. Or when a backup Data Collector can continue processing from the last-saved offset recorded by the previous Data Collector. For example, if a pipeline reads from an external system such as a relational database or Elasticsearch, any Data Collector within the same network and with an identical configuration can continue processing from the last-saved offset recorded by another Data Collector.

Disable failover for a job when the pipeline origin is tied to a particular Data Collector machine. In this situation, a backup Data Collector cannot continue processing from the last-saved offset recorded by the previous Data Collector. For example, let's say that a pipeline contains a Directory origin that reads data from a local directory on the Data Collector machine. The pipeline runs on a group of three Data Collectors, each of which have the same local directory that contains a different source file. If one of the Data Collector machines unexpectedly shuts down, then no other Data Collector can read the local file on that machine. As a result, the failed pipeline cannot be restarted on another Data Collector.

Configure Origin Systems for Failover

Enabling Data Collector pipeline failover provides high availability for pipeline processing. It does not provide high availability of the incoming data that a pipeline reads. Pipeline failover can take several minutes to complete. To ensure that no incoming data is lost during the recovery of a failed pipeline, you might need to configure the origin system to support pipeline failover.

For example, if a pipeline contains an origin that listens for requests from a client, such as the HTTP Server origin or the WebSocket Server origin, the client can continue sending requests during the downtime, which can result in lost data. To avoid data loss, configure the origin system in the following ways:

Configure clients to resend requests when an error occurs while sending data.
Set up load balancing on the origin system to redirect client requests to the remaining running pipelines during a pipeline failover.

Enabling Pipeline Failover

You can configure a Data Collector job for pipeline failover. Enable pipeline failover when you create a job or when you edit an inactive job.

Important: Before you enable pipeline failover for a job, use engine labels to define a group of at least two Data Collectors that the job can start on.

To enable pipeline failover when you edit an inactive job:

In the Navigation panel, click Run > Job Instances.
Hover over the inactive Data Collector job, and click the Edit icon: .
Set the Number of Instances property to a value less than the number of available Data Collectors.
This reserves available Data Collectors as backups for pipeline failover.
Select the Enable Failover property.

Optionally, set one or both of the following failover retries properties:


Failover Retries Property	Description
Failover Retries per Data Collector	Maximum number of retries to attempt on each available Data Collector. When a Data Collector reaches the maximum number of failover retries, Control Hub does not attempt to restart additional failed pipelines for the job on that Data Collector. Use -1 to retry indefinitely.
Global Failover Retries	Maximum number of retries to attempt across all available Data Collectors. When the maximum number of global failover retries is reached, Control Hub stops the job. Use -1 to retry indefinitely.

Click Save.

Balancing Jobs Enabled for Failover

You can balance active Data Collector jobs enabled for pipeline failover. To balance a job, Control Hub redistributes the pipeline load across available Data Collectors that are running the fewest number of pipelines and that have not exceeded any resource thresholds.

When balancing an active job, Control Hub performs the following actions:

Automatically determines if the pipeline load is evenly distributed across available Data Collectors that have not exceeded any resource thresholds.
If the pipeline load is evenly distributed, Control Hub does not continue with the remaining actions.
If the pipeline load is not evenly distributed - meaning that an available Data Collector not currently running a pipeline instance for the job is running fewer pipelines than another Data Collector currently running a pipeline instance for the job - then Control Hub continues with the remaining actions.
Stops the job so that all running pipeline instances are stopped, and then waits until each Data Collector sends the last-saved offset back to Control Hub. Control Hub maintains the last-saved offsets for all pipeline instances in the job.
Redistributes the pipeline load across available Data Collectors that have not exceeded any resource thresholds, sending the last-saved for each pipeline instance to a Data Collector.
Starts the job, which restarts the pipeline instances from the last-saved offsets so that processing can continue from where the pipelines last stopped.

In most cases, you'll balance a job after a pipeline failover occurs. However, you can balance a job enabled for pipeline failover anytime you notice that the pipeline load is not evenly distributed across available Data Collectors.

For example, let’s say that you run a job on a group of four Data Collectors assigned the WesternRegion label. You’ve enabled failover for the job and have set the Number of Instances property to two, reserving two of the Data Collectors for pipeline failover. When you start the job, a pipeline instance runs on Data Collector 1 and Data Collector 2 because they are currently running the fewest number of pipelines.

After a while, Data Collector 1 unexpectedly shuts down, causing the pipeline to fail over to Data Collector 3 which is already running two pipelines for two other jobs. When Data Collector 1 restarts, it does not immediately run any pipelines. However, Data Collector 3 is currently running three pipelines. You balance the job to redistribute the pipeline load. Control Hub automatically determines that Data Collector 1 is available and running the fewest number of pipelines. Control Hub stops the pipeline on Data Collector 3, and restarts the pipeline on Data Collector 1, starting from the last-saved offset.

Balance Specific Jobs

From the Job Instances view, you can balance specific Data Collector jobs enabled for failover. When balancing the jobs, Control Hub redistributes the pipeline load evenly across Data Collectors that have the necessary labels and that have not exceeded any resource thresholds.

In the navigation panel, click Run > Job Instances.
Select the jobs that you want to balance.
Click the Balance Jobs icon: .
At the confirmation prompt, click OK.

Balance Jobs on Specific Data Collectors

From the Execute view, you can balance all the Data Collector jobs enabled for failover and running on specific Data Collectors. When balancing the jobs, Control Hub redistributes the pipeline load evenly across all available Data Collectors that have the necessary labels and that have not exceeded any resource thresholds.

In the navigation panel, click Set Up > Engines.
Select the Data Collectors running jobs that you want to balance.
Click the Balance icon: .
At the confirmation prompt, click Balance.

Comparing Balance Jobs and Synchronize Jobs

Balancing Data Collector jobs differs from synchronizing jobs. The following table lists the key differences:


Action	Description
Balance Jobs	Balance a job to redistribute the pipeline load for a job enabled for failover. Only jobs enabled for pipeline failover and that are running on a Data Collector can be balanced. When you balance a job, Control Hub performs the following actions: Automatically determines if the pipeline load is evenly distributed across available Data Collectors that have not exceeded any resource thresholds. If the pipeline load is evenly distributed, Control Hub does not continue with the remaining actions. If the pipeline load is not evenly distributed - meaning that an available Data Collector not currently running a pipeline instance for the job is running fewer pipelines than another Data Collector currently running a pipeline instance for the job - then Control Hub continues with the remaining actions. Stops the job so that all running pipeline instances are stopped, and then waits until each Data Collector sends the last-saved offset back to Control Hub. Control Hub maintains the last-saved offsets for all pipeline instances in the job. Redistributes the pipeline load across available Data Collectors that have not exceeded any resource thresholds, sending the last-saved for each pipeline instance to a Data Collector. Starts the job, which restarts the pipeline instances from the last-saved offsets so that processing can continue from where the pipelines last stopped.
Synchronize Jobs	Synchronize a job when you've changed the labels assigned to execution engines and the job is actively running on those engines. Any job can be synchronized. When you synchronize a job, Control Hub performs the following actions: Stops the job so that all running pipeline instances are stopped, and then waits until each Data Collector sends the last-saved offset back to Control Hub. Control Hub maintains the last-saved offsets for all pipeline instances in the job. Reassigns the pipeline instances to Data Collectors as follows, sending the last-saved offset for each pipeline instance to a Data Collector: Assigns pipeline instances to additional Data Collectors that match the same labels as the job and that have not exceeded any resource thresholds. Does not assign pipeline instances to Data Collectors that no longer match the same labels as the job. Reassigns pipeline instances on the same Data Collector that matches the same labels as the job and that has not exceeded any resource thresholds. For example, a pipeline might have stopped running after encountering an error or after being deleted from that Data Collector. Starts the job, which restarts the pipeline instances from the last-saved offsets so that processing can continue from where the pipelines last stopped.