CSV Parser

When you configure a stage to read delimited data, you can choose the CSV parser to use. You can use one of the following parsers to read delimited data:

Apache Commons

The Apache Commons parser can process a range of delimited format types.

When the Apache Commons parser encounters certain errors, it skips the line and continues processing the remainder of the file, message, or object. Though it can be slower than the Univocity parser, the Apache Commons parser is the more robust parser to use.

When you use the Apache Commons parser, you specify the delimiter format type and related properties. You also specify the maximum record length to process. When a record exceeds the maximum record length defined for the stage, the stage processes the file, object, or message based on the stage configuration.

The Apache Commons parser can process the following delimited format types:

Default CSV - File that includes comma-separated values. Ignores empty lines in the file.
RFC4180 CSV - Comma-separated file that strictly follows RFC4180 guidelines.
MS Excel CSV - Microsoft Excel comma-separated file.
MySQL CSV - MySQL comma-separated file.
Tab-Separated Values - File that includes tab-separated values.
PostgreSQL CSV - PostgreSQL comma-separated file.
PostgreSQL Text - PostgreSQL text file.
Custom - File that uses user-defined delimiter, escape, and quote characters.
Multi Character Delimited - File that uses multiple user-defined characters to delimit fields and lines, and single user-defined escape and quote characters.

Apache Commons is the default CSV parser.

Univocity

The Univocity CSV parser can provide better performance than the Apache Commons parser, especially when processing wide files such as those including over 200 columns.

When you use the Univocity parser, you specify the field separator, escape character, quote character, and line character to use. You define the maximum number of columns and maximum number of characters for each column to process. You can also configure the stage to skip empty lines and to allow comments.

When the pipeline begins, the Univocity parser is allocated the amount of memory required to process the configured maximum number of characters for the maximum number of columns.

When a record exceeds either maximum, the parser skips processing the remainder of the file, object, or message, and proceeds to the next file. The stage processes the problematic file, object, or message based on the stage configuration.