Number of Pipeline Instances

For Data Collector and Data Collector Edge jobs, you can manually scale out pipeline processing by increasing the number of pipeline instances that Control Hub runs for a job.

Note: For Transformer jobs, you cannot increase the number of pipeline instances. Control Hub always runs a single pipeline instance on one Transformer for each job. When Transformer runs the pipeline on Spark, Spark runs the application just as it runs any other application, automatically scaling out the processing across nodes in the cluster.

By default when you start a job, Control Hub runs one pipeline instance on an available execution engine running the fewest number of pipelines. An available execution engine includes any registered engine assigned all labels specified for the job and that has not exceeded any resource thresholds.

For example, if three Data Collectors have all of the specified labels for the job, Control Hub runs one pipeline instance on the Data Collector running the fewest number of pipelines.

When you run multiple pipeline instances for a Data Collector or SDC Edge job, each pipeline instance runs on a separate Data Collector or SDC Edge. The pipeline instances do not communicate with each other. Each pipeline instance simply completes the same set of instructions. This can result in the same data being processed multiple times if the pipeline is not designed to be run with multiple instances or if the origin system is not designed for scaling out.

For example, let's say you have a pipeline that uses an SFTP/FTP Client origin to read files from a server using the Secure File Transfer Protocol (SFTP). When you create a job for the pipeline, you configure the number of pipeline instances to two. When Control Hub starts the job, it runs two pipeline instances, each of which begin reading the same files from the same server - which results in duplicate data being processed.

Let's look at a few example Data Collector and SDC Edge pipelines designed for scaling out:

Reading log files from multiple servers
You have three web servers that contain log files in the same directory. A Data Collector runs on each of the web servers. You design a pipeline that uses a Directory origin to read log files from the directory. You create a job for the pipeline that sets the number of pipeline instances to 3. When you start the job, Control Hub runs three pipeline instances, one on each of the web server Data Collectors. Each pipeline reads and processes a different set of data - the local log file on that server.
Reading local files from multiple edge devices
You have five edge devices that contain local files in the same directory. An SDC Edge runs on each of the edge devices. You design a pipeline that uses a File Tail origin to read the files from the directory. You create a job for the pipeline that sets the number of pipeline instances to 5. When you start the job, Control Hub runs five pipeline instances, one on each of the Edge Data Collectors. Each pipeline reads and processes a different set of data - the local file on that edge device.
Reading from Kafka
You design a Data Collector pipeline that uses a Kafka Multitopic Consumer origin to read from one Kafka topic that contains two partitions. You create a job for the pipeline that sets the number of pipeline instances to 2. When you start the job, Control Hub runs two pipeline instances. Kafka automatically handles the partitioning, such that each pipeline instance reads from a separate partition.