When you configure a stage to read delimited data,
you can choose the CSV parser to use. You can use one of the following parsers to read
delimited data:
- Apache Commons
- The Apache Commons parser can process a range of delimited format types.
- When the Apache Commons parser encounters certain errors, it skips the line
and continues processing the remainder of the file, message, or object.
Though it can be slower than the Univocity parser, the Apache Commons parser
is the more robust parser to use.
- When you use the Apache Commons parser, you specify the delimiter format
type and related properties. You also specify the maximum record length to
process. When a record exceeds the maximum record length defined for the
stage, the stage processes the file, object, or message based on the stage
configuration.
- The Apache Commons parser can process the following delimited format
types:
-
- Default CSV - File that includes comma-separated
values. Ignores empty lines in the file.
- RFC4180 CSV - Comma-separated file that strictly
follows RFC4180 guidelines.
- MS Excel CSV - Microsoft Excel comma-separated
file.
- MySQL CSV - MySQL comma-separated file.
- Tab-Separated Values - File that includes
tab-separated values.
- PostgreSQL CSV - PostgreSQL comma-separated
file.
- PostgreSQL Text - PostgreSQL text file.
- Custom - File that uses user-defined delimiter,
escape, and quote characters.
- Multi Character Delimited - File that uses
multiple user-defined characters to delimit fields and lines, and
single user-defined escape and quote characters.
- Apache Commons is the default CSV parser.
- Univocity
- The Univocity CSV parser can provide better performance than the Apache
Commons parser, especially when processing wide files such as those
including over 200 columns.
- When you use the Univocity parser, you specify the field separator, escape
character, quote character, and line character to use. You define the
maximum number of columns and maximum number of characters for each column
to process. You can also configure the stage to skip empty lines and to
allow comments.
- When the pipeline begins, the Univocity parser is allocated the amount of
memory required to process the configured maximum number of characters for
the maximum number of columns.
- When a record exceeds either maximum, the parser skips processing the
remainder of the file, object, or message, and proceeds to the next file.
The stage processes the problematic file, object, or message based on the
stage configuration.