Multithreaded Processing

The Directory origin can use multiple threads to perform parallel processing based on the Number of Threads property.

Not valid in Data Collector Edge pipelines. In Data Collector Edge pipelines, the Directory origin uses only one thread to process data. The Number of Threads property is ignored.

When using multiple threads in a pipeline, each thread reads data from a single file, and each file can have a maximum of one thread read from it at a time. The file read order is based on the configuration for the Read Order property.

As the pipeline runs, each thread connects to the origin system, creates a batch of data, and passes the batch to an available pipeline runner. A pipeline runner is a sourceless pipeline instance - an instance of the pipeline that includes all of the processors, executors, and destinations in the pipeline and handles all pipeline processing after the origin.

Each pipeline runner processes one batch at a time, just like a pipeline that runs on a single thread. When the flow of data slows, the pipeline runners wait idly until they are needed, generating an empty batch at regular intervals. You can configure the Runner Idle Time pipeline property to specify the interval or to opt out of empty batch generation.

Multithreaded pipelines preserve the order of records within each batch, just like a single-threaded pipeline. But since batches are processed by different pipeline runners, the order that batches are written to destinations is not ensured.

For example, say you configure the origin to read files from a directory using 5 threads and the Last Modified Timestamp read order. When you start the pipeline, the origin creates five threads, and Data Collector creates a matching number of pipeline runners.

The Directory origin assigns a thread to each of the five oldest files in the directory. Each thread processes its assigned file, passing batches of data to the origin. Upon receiving data, the origin passes a batch to each of the pipeline runners for processing.

After each thread completes processing a file, it continues to the next file based on the last-modified timestamp, until all files are processed.

For more information about multithreaded pipelines, see Multithreaded Pipeline Overview.