Number of Pipeline Instances

For Data Collector jobs, you can manually scale out pipeline processing by increasing the number of pipeline instances that Control Hub runs for a job.

Note: For Transformer jobs, you cannot increase the number of pipeline instances. Control Hub always runs a single pipeline instance on one Transformer for each job. When Transformer runs the pipeline on Spark, Spark runs the application just as it runs any other application, automatically scaling out the processing across nodes in the cluster.

By default when you start a job, Control Hub runs one pipeline instance on an available execution engine running the fewest number of pipelines. An available execution engine includes any registered engine assigned all labels specified for the job and that has not exceeded any resource thresholds.

For example, if three Data Collectors have all of the specified labels for the job, Control Hub runs one pipeline instance on the Data Collector running the fewest number of pipelines.

When you run multiple pipeline instances for a Data Collector job, each pipeline instance runs on a separate Data Collector. The pipeline instances do not communicate with each other. Each pipeline instance simply completes the same set of instructions. This can result in the same data being processed multiple times if the pipeline is not designed to be run with multiple instances or if the origin system is not designed for scaling out.

For example, let's say you have a pipeline that uses an SFTP/FTP Client origin to read files from a server using the Secure File Transfer Protocol (SFTP). When you create a job for the pipeline, you configure the number of pipeline instances to two. When Control Hub starts the job, it runs two pipeline instances, each of which begin reading the same files from the same server - which results in duplicate data being processed.

Let's look at a few example Data Collector pipelines designed for scaling out:

Reading log files from multiple servers: You have three web servers that contain log files in the same directory. A Data Collector runs on each of the web servers. You design a pipeline that uses a Directory origin to read log files from the directory. You create a job for the pipeline that sets the number of pipeline instances to 3. When you start the job, Control Hub runs three pipeline instances, one on each of the web server Data Collectors. Each pipeline reads and processes a different set of data - the local log file on that server.
Reading from Kafka: You design a Data Collector pipeline that uses a Kafka Multitopic Consumer origin to read from one Kafka topic that contains two partitions. You create a job for the pipeline that sets the number of pipeline instances to 2. When you start the job, Control Hub runs two pipeline instances. Kafka automatically handles the partitioning, such that each pipeline instance reads from a separate partition.

Scaling Out Active Jobs

Control Hub can automatically scale out pipeline processing for an active Data Collector job when you set the number of pipeline instances to -1.

When the number of pipeline instances is set to -1, Control Hub runs one pipeline instance on each available Data Collector. If the job is active while an additional Data Collector becomes available, Control Hub automatically starts an additional pipeline instance on that execution engine.

For example, if three Data Collectors have all of the specified labels for the job when you start the job, Control Hub runs three pipeline instances, one on each Data Collector. If you register another Data Collector with the same labels as the currently running job, Control Hub automatically starts a fourth pipeline instance on that newly available Data Collector.

When the number of pipeline instances is set to any other value, you must synchronize the active job to start additional pipeline instances on newly available Data Collectors.

Control Hub can automatically scale out pipeline processing up to a maximum of 50 pipeline instances for an active job.

Note: If you configure a job to automatically scale out pipeline processing, you cannot also configure the job for pipeline failover. Setting the number of pipeline instances to -1 runs an instance on each available Data Collector - which doesn't reserve an available Data Collector required as a backup for pipeline failover.

Rules and Guidelines

Consider the following rules and guidelines when you configure the number of pipeline instances for a Data Collector job:

If you set the number of pipeline instances to a value greater than the number of available Data Collectors, Control Hub fails to start the job.
If you decrease the number of pipeline instances for a job that has already run, Control Hub cannot maintain the last-saved offset for each of the original pipeline instances. As a result, you might encounter unexpected behavior when you restart the modified job.