Data Collector Pipeline Failover
You can enable a Data Collector job for pipeline failover. Enable pipeline failover to minimize downtime due to unexpected pipeline failures and to help you achieve high availability.
When a job is enabled for failover, Control Hub can restart a failed pipeline on another available execution engine that is assigned all labels specified for the job, starting from the last-saved offset.
- The pipeline has reached the maximum number of retry attempts after encountering an error and has transitioned to a Start_Error or Run_Error state.
- The Data Collector running the pipeline has been unresponsive for the maximum engine heartbeat
interval.
A Data Collector can become unresponsive because it shuts down or because it cannot connect to Control Hub due to a network or system outage.
An available Data Collector includes any Data Collector that is assigned all labels specified for the job, that is not currently running a pipeline instance for the job, and that has not exceeded any resource thresholds. When multiple Data Collectors are available, Control Hub prioritizes Data Collectors that have not previously failed the pipeline and Data Collectors that are currently running the fewest number of pipelines.
For example, you enable a job for failover, set the number of pipeline instances to one, and then start the job on a group of three Data Collectors. Control Hub initially sends the pipeline instance to Data Collector A, but the pipeline fails on Data Collector A. At the time of failover, Data Collector A is running no other pipelines, Data Collector B is running one other pipeline, and Data Collector C is running two other pipelines. Control Hub restarts the failed pipeline on Data Collector B. If all three Data Collectors had already failed the pipeline, then Control Hub would restart the failed pipeline on Data Collector A.
Failover and Number of Pipeline Instances
When you enable a Data Collector job for failover, set the number of pipeline instances that the job runs to a value less than the number of available Data Collectors. This reserves Data Collectors for pipeline failover. As a best practice, reserve at least two Data Collectors as backups.
For example, you want to run a job on the group of four Data Collectors
assigned the WesternRegion
label, and want to reserve two of the Data Collectors
for pipeline failover. You assign the WesternRegion
label to the job,
enable failover for the job, and set the Number of Instances property to two. When you
start the job, Control Hub
sends pipeline instances to the two Data Collectors
currently running the fewest number of pipelines. The third and fourth Data Collectors
serve as backups and are available for pipeline failover if another Data Collector
shuts down or a pipeline encounters an error.
For more information about configuring the number of pipeline instances for a job, see Number of Pipeline Instances.
Failover Retries
When a Data Collector job is enabled for failover, Control Hub retries the failover an infinite number of times by default. If you want the failover to stop after a given number of retries, define the maximum number of retries to perform.
- Failover Retries per Data Collector
- Maximum number of pipeline failover retries to attempt on each available Data Collector. The initial start of a pipeline instance on a Data Collector counts as the first retry attempt.
- Global Failover Retries
- Maximum number of pipeline failover retries to attempt across all available Data Collectors.
Control Hub increments the failover retry count and applies the retry limit only when the pipeline encounters an error and transitions to a Start_Error or Run_Error state. If the engine running the pipeline shuts down, failover always occurs and Control Hub does not increment the failover retry count.
Example for Failover Retries per Data Collector
Let's look at an example of how Control Hub maintains the Failover Retries per Data Collector property.
- Control Hub sends one pipeline instance to Data Collector A and another to Data Collector B.
Data Collector C and Data Collector D serve as backups.
- After some time, the pipeline on Data Collector A fails.
- Control Hub attempts to restart the failed pipeline on Data Collector C, but the failover attempt fails. Control Hub increments the failover attempt to one for Data Collector C, and then successfully restarts the failed pipeline on Data Collector D.
- After additional time, the pipeline on Data Collector B fails.
- Control Hub attempts to restart the failed pipeline on Data Collector C, but the failover attempt fails. Control Hub increments the failover attempt to two for Data Collector C, and then successfully restarts the failed pipeline on Data Collector A.
Since Data Collector C has reached the maximum number of failover attempts, Control Hub does not attempt to restart additional pipelines for this job on Data Collector C.
Determining When to Enable Failover
Consider the pipeline origin when deciding whether to enable failover for a Data Collector job.
Enable failover for a job when the external origin system maintains the offset, such as Kafka. Or when a backup Data Collector can continue processing from the last-saved offset recorded by the previous Data Collector. For example, if a pipeline reads from an external system such as a relational database or Elasticsearch, any Data Collector within the same network and with an identical configuration can continue processing from the last-saved offset recorded by another Data Collector.
Disable failover for a job when the pipeline origin is tied to a particular Data Collector machine. In this situation, a backup Data Collector cannot continue processing from the last-saved offset recorded by the previous Data Collector. For example, let's say that a pipeline contains a Directory origin that reads data from a local directory on the Data Collector machine. The pipeline runs on a group of three Data Collectors, each of which have the same local directory that contains a different source file. If one of the Data Collector machines unexpectedly shuts down, then no other Data Collector can read the local file on that machine. As a result, the failed pipeline cannot be restarted on another Data Collector.
Configure Origin Systems for Failover
Enabling Data Collector pipeline failover provides high availability for pipeline processing. It does not provide high availability of the incoming data that a pipeline reads. Pipeline failover can take several minutes to complete. To ensure that no incoming data is lost during the recovery of a failed pipeline, you might need to configure the origin system to support pipeline failover.
For example, if a pipeline contains an origin that listens for requests from a client, such as the HTTP Server origin or the WebSocket Server origin, the client can continue sending requests during the downtime, which can result in lost data. To avoid data loss, configure the origin system in the following ways:
- Configure clients to resend requests when an error occurs while sending data.
- Set up load balancing on the origin system to redirect client requests to the remaining running pipelines during a pipeline failover.
Enabling Pipeline Failover
You can configure a Data Collector job for pipeline failover. Enable pipeline failover when you create a job or when you edit an inactive job.
To enable pipeline failover when you edit an inactive job:
- In the Navigation panel, click .
- Hover over the inactive Data Collector job, and click the Edit icon: .
-
Set the Number of Instances property to a value less
than the number of available Data Collectors.
This reserves available Data Collectors as backups for pipeline failover.
- Select the Enable Failover property.
-
Optionally, set one or both of the following failover retries
properties:
Failover Retries Property Description Failover Retries per Data Collector Maximum number of retries to attempt on each available Data Collector. When a Data Collector reaches the maximum number of failover retries, Control Hub does not attempt to restart additional failed pipelines for the job on that Data Collector. Use -1 to retry indefinitely.
Global Failover Retries Maximum number of retries to attempt across all available Data Collectors. When the maximum number of global failover retries is reached, Control Hub stops the job. Use -1 to retry indefinitely.
- Click Save.
Balancing Jobs Enabled for Failover
You can balance active Data Collector jobs enabled for pipeline failover. To balance a job, Control Hub redistributes the pipeline load across available Data Collectors that are running the fewest number of pipelines and that have not exceeded any resource thresholds.
- Automatically determines if the pipeline load is evenly distributed across available
Data Collectors that have not
exceeded any resource thresholds.
If the pipeline load is evenly distributed, Control Hub does not continue with the remaining actions.
If the pipeline load is not evenly distributed - meaning that an available Data Collector not currently running a pipeline instance for the job is running fewer pipelines than another Data Collector currently running a pipeline instance for the job - then Control Hub continues with the remaining actions.
- Stops the job so that all running pipeline instances are stopped, and then waits until each Data Collector sends the last-saved offset back to Control Hub. Control Hub maintains the last-saved offsets for all pipeline instances in the job.
- Redistributes the pipeline load across available Data Collectors that have not exceeded any resource thresholds, sending the last-saved for each pipeline instance to a Data Collector.
- Starts the job, which restarts the pipeline instances from the last-saved offsets so that processing can continue from where the pipelines last stopped.
In most cases, you'll balance a job after a pipeline failover occurs. However, you can balance a job enabled for pipeline failover anytime you notice that the pipeline load is not evenly distributed across available Data Collectors.
For example, let’s say that you run a job on a group of four Data Collectors
assigned the WesternRegion
label. You’ve enabled failover for the job
and have set the Number of Instances property to two, reserving two of the Data Collectors
for pipeline failover. When you start the job, a pipeline instance runs on Data Collector 1
and Data Collector
2 because they are currently running the fewest number of pipelines.
After a while, Data Collector 1 unexpectedly shuts down, causing the pipeline to fail over to Data Collector 3 which is already running two pipelines for two other jobs. When Data Collector 1 restarts, it does not immediately run any pipelines. However, Data Collector 3 is currently running three pipelines. You balance the job to redistribute the pipeline load. Control Hub automatically determines that Data Collector 1 is available and running the fewest number of pipelines. Control Hub stops the pipeline on Data Collector 3, and restarts the pipeline on Data Collector 1, starting from the last-saved offset.
Balance Specific Jobs
From the Job Instances view, you can balance specific Data Collector jobs enabled for failover. When balancing the jobs, Control Hub redistributes the pipeline load evenly across Data Collectors that have the necessary labels and that have not exceeded any resource thresholds.
- In the navigation panel, click .
- Select the jobs that you want to balance.
- Click the Balance Jobs icon: .
- At the confirmation prompt, click OK.
Balance Jobs on Specific Data Collectors
From the Execute view, you can balance all the Data Collector jobs enabled for failover and running on specific Data Collectors. When balancing the jobs, Control Hub redistributes the pipeline load evenly across all available Data Collectors that have the necessary labels and that have not exceeded any resource thresholds.
- In the navigation panel, click .
- Select the Data Collectors running jobs that you want to balance.
- Click the Balance icon: .
- At the confirmation prompt, click Balance.
Comparing Balance Jobs and Synchronize Jobs
Balancing Data Collector jobs differs from synchronizing jobs. The following table lists the key differences:
Action | Description |
---|---|
Balance Jobs |
|
Synchronize Jobs |
|