Delimited Data Format
Data Collector can read and write delimited data.
Reading Delimited Data
Origins that read delimited data generate a record for each delimited line in a file, object, or message. Processors that process delimited data generate records as described in the processor overview.
The CSV parser that you choose determines the delimiter properties that you configure and how the stage handles parsing errors. You can specify if the data includes a header line and whether to use it. You can define the number of lines to skip before reading, the character set of the data, and the root field type to use for the generated record.
You can also configure the stage to replace a string constant with null values and to ignore control characters.
File based origins can read from compressed files and archives.
For a list of stages that process delimited data, see Data Formats by Stage.
CSV Parser
When you configure a stage to read delimited data, you can choose the CSV parser to use. You can use one of the following parsers to read delimited data:
- Apache Commons
- The Apache Commons parser can process a range of delimited format types.
- Univocity
- The Univocity CSV parser can provide better performance than the Apache Commons parser, especially when processing wide files such as those including over 200 columns.
Delimited Data Root Field Type
Records created from delimited data can use either the list or list-map data type for the root field.
When origins or processors create records for delimited data, they create a single root field of the specified type and write the delimited data within the root field.
Use the default list-map root field type to easily process delimited data.
- List-Map
- Provides easy use of field names or column positions in expressions. Recommended for all new pipelines.
- List
- Provides continued support for pipelines created before version 1.1.0. Not recommended for new pipelines.
Writing Delimited Data
When processing delimited data, file- or object-based destinations write each record as a delimited row in a file or object. Message-based destinations write each record as a message. Processors write delimited data as specified in the processor overview.
Destinations write records as delimited data. When you use this data format, the root field must be list or list-map.
All destinations use the Apache Commons CSV parser to process delimited data. The Apache Commons parser can write data as the following delimited format types:
- Default CSV - File that includes comma-separated values. Ignores empty lines in the file.
- RFC4180 CSV - Comma-separated file that strictly follows RFC4180 guidelines.
- MS Excel CSV - Microsoft Excel comma-separated file.
- MySQL CSV - MySQL comma-separated file.
- Tab-Separated Values - File that includes tab-separated values.
- PostgreSQL CSV - PostgreSQL comma-separated file.
- PostgreSQL Text - PostgreSQL text file.
- Custom - File that uses user-defined delimiter, escape, and quote characters.
For a list of stages that write delimited data, see Data Formats by Stage.