Delimited Data Format

Data Collector can read and write delimited data.

Reading Delimited Data

Origins that read delimited data generate a record for each delimited line in a file, object, or message. Processors that process delimited data generate records as described in the processor overview.

The CSV parser that you choose determines the delimiter properties that you configure and how the stage handles parsing errors. You can specify if the data includes a header line and whether to use it. You can define the number of lines to skip before reading, the character set of the data, and the root field type to use for the generated record.

You can also configure the stage to replace a string constant with null values and to ignore control characters.

File based origins can read from compressed files and archives.

For a list of stages that process delimited data, see Data Formats by Stage.

CSV Parser

When you configure a stage to read delimited data, you can choose the CSV parser to use. You can use one of the following parsers to read delimited data:

Apache Commons

The Apache Commons parser can process a range of delimited format types.

When the Apache Commons parser encounters certain errors, it skips the line and continues processing the remainder of the file, message, or object. Though it can be slower than the Univocity parser, the Apache Commons parser is the more robust parser to use.

When you use the Apache Commons parser, you specify the delimiter format type and related properties. You also specify the maximum record length to process. When a record exceeds the maximum record length defined for the stage, the stage processes the file, object, or message based on the stage configuration.

The Apache Commons parser can process the following delimited format types:

Default CSV - File that includes comma-separated values. Ignores empty lines in the file.
RFC4180 CSV - Comma-separated file that strictly follows RFC4180 guidelines.
MS Excel CSV - Microsoft Excel comma-separated file.
MySQL CSV - MySQL comma-separated file.
Tab-Separated Values - File that includes tab-separated values.
PostgreSQL CSV - PostgreSQL comma-separated file.
PostgreSQL Text - PostgreSQL text file.
Custom - File that uses user-defined delimiter, escape, and quote characters.
Multi Character Delimited - File that uses multiple user-defined characters to delimit fields and lines, and single user-defined escape and quote characters.

Apache Commons is the default CSV parser.

Univocity

The Univocity CSV parser can provide better performance than the Apache Commons parser, especially when processing wide files such as those including over 200 columns.

When you use the Univocity parser, you specify the field separator, escape character, quote character, and line character to use. You define the maximum number of columns and maximum number of characters for each column to process. You can also configure the stage to skip empty lines and to allow comments.

When the pipeline begins, the Univocity parser is allocated the amount of memory required to process the configured maximum number of characters for the maximum number of columns.

When a record exceeds either maximum, the parser skips processing the remainder of the file, object, or message, and proceeds to the next file. The stage processes the problematic file, object, or message based on the stage configuration.

Delimited Data Root Field Type

Records created from delimited data can use either the list or list-map data type for the root field.

When origins or processors create records for delimited data, they create a single root field of the specified type and write the delimited data within the root field.

Use the default list-map root field type to easily process delimited data.

List-Map

Provides easy use of field names or column positions in expressions. Recommended for all new pipelines.

A list-map root field type results in a structure that preserves the order of data, as follows:

/<first header>:<value>
/<second header>:<value>
/<third header>:<value>
...

For example, with the list-map root field type, the following delimited rows:

TransactionID,Type,UserID
0003420303,04,362
0003420304,08,1008

are converted to records as follows:

/TransactionID: 0003420303
/Type: 04
/UserID: 362

/TransactionID: 0003420304
/Type: 08
/UserID: 1008

If data does not include a header or if you choose to ignore a header, list-map records use the column position as a header as follows:

0: <value>
1: <value>
2: <value>

For example, when you ignore the header for the same data, you get the following records:

0: 0003420303
1: 04
2: 362

0: 0003420304
1: 08
2: 1008

In an expression, you can use the field name or the column position with a standard record function to call a field. For example, you can use either of the following record:value() expressions to return data in the TransactionID field:

${record:value('/TransactionID')}
${record:value('[0]'}

Note: When writing scripts for scripting processors, such as the Jython Evaluator or JavaScript Evaluator, you should treat list-map records as maps.

For more information about standard record functions, see Record Functions.

List

Provides continued support for pipelines created before version 1.1.0. Not recommended for new pipelines.

A list root field type results in list with an index for the header position and a map with each header and associated value, as follows:

0
   /header = <first header>
   /value = <value for first header>
1
   /header = <second header>
   /value = <value for second header>
2  
   /header = <third header>
   /value = <value for third header>
...

For example, the same delimited rows described above are converted to records as follows:

0
   /header = TransactionID
   /value = 0003420303
1
   /header = Type
   /value = 04
2
   /header = UserID
   /value = 362

0
   /header = TransactionID
   /value = 0003420304
1
   /header = Type
   /value = 08
2
   /header = UserID
   /value = 1008

If the data does not include a header or if you choose to ignore a header, the list records omit the header from the map as follows:

0
   /value = <value>
1
   /value = <value>
2
   /value = <value>
...

For example, when you ignore the header for the same sample data, you get the following records:

0
   /value = 0003420303
1
   /value = 04
2
   /value = 362

0
   /value = 0003420304
1
   /value = 08
2
   /value = 1008

For data in the list records, you should either use the delimited data functions or include the full field path in standard record functions. For example, you can use the record:dValue() delimited data function to return the value associated with the specified header.

Tip: You can use the record:dToMap() function to convert a list record to a map, and then use standard functions for record processing.

For more information about record:dToMap and full list of delimited data record functions and their syntax, see Delimited Data Record Functions.

For a full list of origins that support this data format, see Origins in the "Data Formats by Stage" appendix.

Writing Delimited Data

When processing delimited data, file- or object-based destinations write each record as a delimited row in a file or object. Message-based destinations write each record as a message. Processors write delimited data as specified in the processor overview.

Destinations write records as delimited data. When you use this data format, the root field must be list or list-map.

All destinations use the Apache Commons CSV parser to process delimited data. The Apache Commons parser can write data as the following delimited format types:

Default CSV - File that includes comma-separated values. Ignores empty lines in the file.
RFC4180 CSV - Comma-separated file that strictly follows RFC4180 guidelines.
MS Excel CSV - Microsoft Excel comma-separated file.
MySQL CSV - MySQL comma-separated file.
Tab-Separated Values - File that includes tab-separated values.
PostgreSQL CSV - PostgreSQL comma-separated file.
PostgreSQL Text - PostgreSQL text file.
Custom - File that uses user-defined delimiter, escape, and quote characters.

For a list of stages that write delimited data, see Data Formats by Stage.