Processing XML Data with Custom Delimiters

You can use custom delimiters with the text data format to process XML data. You might use the text data format to process XML data with no root element, which cannot be processed with the XML data format.

When using the text data format in the origin to read XML data, you can use the XML Parser processor downstream to parse the XML data.

For example, the following XML document is valid and is best processed using the XML data format:

<?xml version="1.0" encoding="UTF-8"?>
<root>
 <msg>
    <time>8/12/2016 6:01:00</time>
    <request>GET /index.html 200</request>
 </msg>
 <msg>
    <time>8/12/2016 6:03:43</time>
    <request>GET /images/sponsored.gif 304</request>
 </msg>
</root>
However, the following XML document does not include an XML prolog or root element, so it is invalid:
<msg>
    <time>8/12/2016 6:01:00</time>
    <request>GET /index.html 200</request>
</msg>
<msg>
    <time>8/12/2016 6:03:43</time>
    <request>GET /images/sponsored.gif 304</request>
</msg>

You can use the text data format with a custom delimiter to process the invalid XML document. To do so, use </msg> as the custom delimiter to separate data into records, and make sure to include the delimiter in the record as follows:

When origins process text data, they write record data into a single text field named "text". When Directory processes the invalid XML document, it creates two records:
text
<msg> <time>8/12/2016 6:01:00</time> <request>GET /index.html 200</request> </msg>
<msg> <time>8/12/2016 6:03:43</time> <request>GET /images/sponsored.gif 304</request> </msg>
You can configure the XML Parser to parse the XML data as follows:

The XML Parser converts the time and request attributes to list fields within the text map field, as shown. The table displays data types in angle brackets ( < > ):
text <map>
- time <list>:
  • 0 <map>:

    - value <string>: 8/12/2016 6:01:00

- request <list>:
  • 0 <map>:

    - value <string>: GET /index.html 200

- time <list>:
  • 0 <map>:

    - value <string>:: 8/12/2016 6:03:43

- request <list>:
  • 0 <map>:

    - value <string>: GET /images/sponsored.gif 304