UDP Multithreaded Source
The UDP Multithreaded Source origin reads messages from one or more UDP ports. The origin can create multiple worker threads to enable parallel processing in a multithreaded pipeline. For information about supported versions, see Supported Systems and Versions in the Data Collector documentation.
When processing NetFlow messages, the stage generates different records based on the NetFlow version. When processing NetFlow 9, the records are generated based on the NetFlow 9 configuration properties. For more information, see NetFlow Data Processing.
The origin can also read binary or character-based raw data.
When you configure UDP Multithreaded Source, you specify the ports to use and the batch size and wait time. You specify the number of worker threads to use in multithreaded processing and you can specify the packet queue size. When epoll is available on the Data Collector machine, you can also specify the number of receiver threads to use to increase the throughput of packets to the pipeline.
You specify the data format for the data, then configure any related properties.
When a pipeline stops, the origin notes where it stops reading. When the pipeline starts again, the origin continues processing from where it stopped by default. You can reset the origin to process all requested data.
Processing Raw Data
Use the Raw/Separated Data data format to enable the UDP Multithreaded Source origin to generate records from binary or character-based raw data.
When processing raw data, the origin can generate a record for each UDP packet that it receives. Or, if you specify a separator character, then the origin can generate multiple records from each UDP packet.
When generating multiple records, you specify the multiple value behavior: one record with only the first value, one record with all values as a list, or multiple records with one record for each value.
You can optionally specify an output field to use for the data. When not specified, the origin writes the raw data to the root field.
You might use the Raw/Separated Data data format to write raw data to a field that you later process using the Data Parser processor. This allows you to retain the raw data for another use.
Receiver and Worker Threads
- Receiver threads
- Used to pass data from the operating system socket to the origin's packet
queue. By default, the origin uses a single receiver thread.
You can configure the origin to use multiple receiver threads when Data Collector runs on a machine enabled for epoll. Epoll requires native libraries and is only available when Data Collector runs on recent versions of 64-bit Linux.
When you enable multiple receiver threads, you increase the rate that data can be passed to the origin, but at the cost of a standard increase of overhead for thread management.
To use additional receiver threads, select the Use Native Transports (epoll) property, and then configure Number of Receiver Threads.
- Worker threads
- Used to perform multithreaded pipeline processing. By default, the origin uses a single thread for pipeline processing. You can increase the number of threads to use to perform parallel processing of larger volumes of data. For more information, see Multithreaded Pipelines.
Packet Queue
The UDP Multithreaded Source origin uses a packet queue to hold incoming data in memory until the data can be incorporated in a batch and passed through the pipeline. When the packet queue is full, incoming packets are dropped. The number of packets that are dropped is noted in stage metrics.
When you configure the origin, you can specify the maximum number of packets to allow in the queue. The default is 200,000. Because the packet queue uses Data Collector heap memory, when increasing the size of the queue, you should consider increasing the Data Collector heap size as well. For more information, see Java Heap Size in the Data Collector documentation.
Multithreaded Pipelines
The UDP Multithreaded Source origin performs parallel processing and enables the creation of a multithreaded pipeline.
When you enable multithreaded processing, the UDP Multithreaded Source origin uses multiple concurrent threads for pipeline processing based on the Number of Worker Threads property. When you start the pipeline, the origin creates the number of threads specified in the property.
As packets arrive from the specified UDP ports, they enter the packet queue. There is a single instance of the packet queue per pipeline. All receiver threads (which can be more than one, when using epoll) place packets onto the queue. At the same time, each worker thread removes packets from the queue, parses them according to the specified data format, and processes the rest of the pipeline using a pipeline runner.
A pipeline runner is a sourceless pipeline instance - an instance of the pipeline that includes all of the processors, executors, and destinations in the pipeline and handles all pipeline processing after the origin. Each pipeline runner processes one batch at a time, just like a pipeline that runs on a single thread. When the flow of data slows, the pipeline runners wait idly until they are needed, generating an empty batch at regular intervals. You can configure the Runner Idle Time pipeline property to specify the interval or to opt out of empty batch generation.
Multithreaded pipelines preserve the order of records within each batch, just like a single-threaded pipeline. But since batches are processed by different pipeline runners, the order that batches are written to destinations is not ensured.
For example, say you enable multithreaded processing and set the Number of Worker Threads property to 5. When you start the pipeline, the origin creates five threads, and Data Collector creates a matching number of pipeline runners. The origin adds incoming data to the packet queue, creates batches of data from the queue and then passes the batches to the pipeline runners for processing.
Each pipeline runner performs the processing associated with the rest of the pipeline. After a batch is written to pipeline destinations, the pipeline runner becomes available for another batch of data. Each batch is processed and written as quickly as possible, independent from other batches processed by other pipeline runners, so batches may be written differently from the read order.
At any given moment, the five pipeline runners can each process a batch, so this multithreaded pipeline processes up to five batches at a time. When incoming data slows, the pipeline runners sit idle, available for use as soon as the data flow increases.
For more information about multithreaded pipelines, see Multithreaded Pipeline Overview.
Metrics for Performance Tuning
The UDP Multithreaded Source origin provides packet queue metrics that you can use to tune pipeline performance.
- Dropped Packets - The number of packets that were dropped because the packet queue was full.
- Queue Size - The current size of the packet queue.
- Queued Packets - The total number of packets that have passed through the packet queue for processing.
These metrics can help you determine how to improve pipeline performance. For example, if you have a high volume of dropped packets and the queue size seems to be maxed out as you monitor the pipeline, you might increase the number of worker threads for the pipeline to allow for greater throughput. Or, if you have relatively high bursts of data volume and find packets getting dropped during those bursts, consider increasing the packet queue size to better accommodate them.
If the queue size is not maxed out, but the number of queued packets does not seem as high as you expect, you might be dropping packets on the operating system side. When epoll is available - that is, when Data Collector runs on recent versions of 64-bit Linux - increasing the number of receiver threads can increase the volume of packets that are passed to the origin.
Configuring a UDP Multithreaded Source
Configure a UDP Multithreaded Source origin to use multiple worker threads to process messages from one or more UDP ports.
-
In the Properties panel, on the General tab, configure the
following properties:
General Property Description Name Stage name. Description Optional description. On Record Error Error record handling for the stage: - Discard - Discards the record.
- Send to Error - Sends the record to the pipeline for error handling.
- Stop Pipeline - Stops the pipeline.
-
On the UDP tab, configure the following properties:
UDP Property Description Port Port to listen to for data. Using simple or bulk edit mode, click the Add icon to list additional ports. To listen to a port below 1024, Data Collector must be run by a user with root privileges. Otherwise, the operating system does not allow Data Collector to bind to the port.
Note: No other pipelines or processes can already be bound to the listening port. The listening port can be used only by a single pipeline.Data Format Data format passed by UDP: - collectd
- NetFlow
- syslog
- Raw/separated data
Use Native Transports (epoll) Specifies whether to use multiple receiver threads for each port. Using multiple receiver threads can improve performance. You can use multiple receiver threads using epoll, which can be available when Data Collector runs on recent versions of 64-bit Linux.
Number of Receiver Threads Number of receiver threads to use for each port. For example, if you configure two threads per port and configure the origin to use three ports, the origin uses a total of six threads. Use to increase the number of threads passing data to the origin when epoll is available on the Data Collector machine.
Default is 1.
Max Batch Size (messages) Maximum number of messages to include in a batch and pass through the pipeline at one time. Honors values up to the Data Collector maximum batch size. Default is 1000. The Data Collector default is 1000.
Batch Wait Time (ms) Milliseconds to wait before sending a partial or empty batch. Packet Queue Size The maximum number of packets to hold in the packet queue for processing. Number of Worker Threads The number of threads that the origin uses to perform pipeline processing. - On the syslog tab, define the character set for the data.
-
On the collectd tab, define the following collectd
properties:
collectd Property Properties Convert Hi-Res Time & Interval Converts the collectd high resolution time format interval and timestamp to UNIX time, in milliseconds. Exclude Interval Excludes the interval field from output record. Auth File Path to an optional authentication file. Use an authentication file to accept signed and encrypted data. TypesDB File Path Path to a user-provided types.db file. Overrides the default types.db file. Charset Character set of the data. -
For raw data, on the Raw/Separated Data tab, define the
following properties:
Raw/Separated Data Property Description Raw Data Mode Type of raw data to process: binary or string data. Output Field Path Optional output field for the raw data. When not used, the origin writes the raw data to the root field. Multiple Values Behavior Action to take when the data in the data separator generates multiple values from a UDP packet:- First Value Only - Returns one record with the first value.
- All Values as a List - Returns one record with all values in a List.
- Split into Multiple Records - Returns multiple records, one record for each value.
Data Separator Optional data separator to use to separate UDP packets to multiple values. Specify byte literals using Java Unicode syntax, \u<character code>. For example, the default line feed character is expressed as follows:
\u000A
.Charset Charset used by string data. -
For NetFlow 9 data, on the NetFlow 9 tab, configure the
following properties:
When processing earlier versions of NetFlow data, these properties are ignored.
Netflow 9 Property Description Record Generation Mode Determines the type of values to include in the record. Select one of the following options: - Raw Only
- Interpreted Only
- Both Raw and Interpreted
Max Templates in Cache The maximum number of templates to store in the template cache. For more information about templates, see Caching NetFlow 9 Templates. Default is -1 for an unlimited cache size.
Template Cache Timeout (ms) The maximum number of milliseconds to cache an idle template. Templates unused for more than the specified time are evicted from the cache. For more information about templates, see Caching NetFlow 9 Templates. Default is -1 for caching templates indefinitely.