XML Parser

The XML Parser processor parses an XML document in a string field and passes the parsed data to a map field in the output record.

When you configure the XML Parser processor, you specify the field that contains the XML document. You specify whether the processor replaces the data in the original field with the parsed data or passes the parsed data to another field. You also configure whether the processor stops the pipeline with an error or continues processing when the specified field to parse does not exist in a record.

You specify the schema that the processor uses to parse the XML document. The processor can infer the schema from the first record that it reads, or use a custom schema specified in JSON or DDL format.

Schema

The XML Parser processor requires a schema to parse the XML document. The processor can use one of the following modes to determine the schema:

Infer from Data: With the default mode, the XML Parser processor infers the schema from the first incoming record. The XML document in the specified field in the first record must include all fields in the schema.
Note: To infer the schema, the processor requires that Apache Spark version 2.4.0 or later is installed on the Transformer machine and on each node in the cluster.
Tip: As you build the pipeline, verify that the processor infers the schema as expected. Previewing the pipeline is the easiest way to determine how the processor infers the schema.
Use Custom Schema - JSON Format: The XML Parser processor uses the custom schema that you specify in JSON format.
Use Custom Schema - DDL Format: The XML Parser processor uses the custom schema that you specify in DDL format.

Use a custom schema when the XML document in the first record does not include all the fields in the schema or when the processor infers the schema inaccurately.

JSON Schema Format

To use JSON to define a custom schema, specify the field names and data types within a root field that uses the Struct data type.

Tip: Data types must be in lowercase letters. Also, the nullable attribute is required for most fields.

Here's an example of the basic structure:

{
  "type": "struct",
  "fields": [
    {
      "name": "<first field name>",
      "type": "<data type>",
      "nullable": <true|false>
    },
    {
      "name": "<second field name>",
      "type": "<data type>",
      "nullable": <true|false>
    }
  ]
}

To define a List field, use the Array data type and specify the data types of the subfields as follows:

{
  "name": "<list field name>",
  "type": {
     "type": "array",
     "elementType": "<subfield data type>",
     "containsNull": <true|false>
     }
}

To define a Map field, use the Struct type, then define the subfields as follows:

{
  "name": "<map field name>",
  "type": {
    "type": "struct",
    "fields": [ {
      "name": "<first subfield name>",
      "type": "<data type>",
      "nullable": <true|false>
       }, {
        "name": "<second subfield name>",
        "type": "<data type>",
        "nullable": <true|false>
        } ] },
  "nullable": <true|false>
}

Example

The following JSON custom schema includes, in order, a String, Boolean, Map, and List field:

{
  "type": "struct",
  "fields": [
    {
      "name": "TransactionID",
      "type": "string",
      "nullable": false
    },
    {
      "name": "Verified",
      "type": "boolean",
      "nullable":false
    },
     {
    "name": "User",
    "type": {
      "type": "struct",
      "fields": [ {
        "name": "ID",
        "type": "long",
        "nullable": true
      }, {
        "name": "Name",
        "type": "string",
        "nullable": true
      } ] },
    "nullable": true
    },
    {
      "name": "Items",
      "type": {
        "type": "array",
        "elementType": "string",
        "containsNull": true},
        "nullable":true
     }
   ]
}

Suppose the order field contains the following XML document:

<root><Verified>true</Verified><Items>T-35089</Items><Items>M-00352</Items><Items>Q-11044</Items><TransactionID>G-23525-3350</TransactionID><User><ID>23005</ID><Name>Marnee Gehosephat</Name></User></root>

The processor generates the following record when configured to replace the order field with the parsed data:

Notice the User Map field with the Long and String subfields and the Items List field with String subfields. In addition, the order of the fields now matches the order in the custom schema. Also note that any remaining fields in the record are passed to the output record unchanged, the storeID field in this example.

DDL Schema Format

To use DDL to define a custom schema, specify a comma-separated list of field names and data types. Here's an example of the basic structure:

<first field name> <data type>, <second field name> <data type>, <third field name> <data type>

To define a List field, use the Array data type and specify the data types of the subfields as follows:

<list field name> Array <subfields data type>

To define a Map field, use the Struct data type, then specify the names and types of the subfields as follows:

<map field name> Struct < <first subfield name>:<data type>, <second subfield name>:<data type> >

Tip: You can use backticks ( ` ) to escape field names that can be mistaken for reserved words, such as `count`.

Example

The following DDL custom schema includes, in order, a String, Boolean, Map, and List field:

TransactionID String, Verified Boolean, User Struct <ID:Integer, Name:String>, Items Array <String>

Suppose the order field contains the following XML document:

<root><Verified>true</Verified><Items>T-35089</Items><Items>M-00352</Items><Items>Q-11044</Items><TransactionID>G-23525-3350</TransactionID><User><ID>23005</ID><Name>Marnee Gehosephat</Name></User></root>

The processor generates the following record when configured to replace the order field with the parsed data:

Schema Error Handling

The XML Parser processor handles schema errors as follows:

When the schema is not valid, the processor stops the pipeline with an error.
When the specified field to parse does not include a valid XML document, the processor handles the error based on the Error Handling configuration on the Schema tab:
- Permissive - The processor writes the invalid XML in a _corrupt_record field.
- Drop Malformed - The processor drops the record from the pipeline and continues processing.
- Fail Fast - The processor stops the pipeline with an error.

When the specified field to parse contains data that does not match the schema, the processor handles the error based on how the XML document differs from the schema:


XML Document	Error handling
Includes a field not defined in the schema	The processor ignores the field, dropping it from the output record.
Omits a field defined in the schema	The processor passes the field to the output record with a null value.
Includes data in a field not compatible with the data type defined in the schema	The processor handles the error based on the Error Handling configuration on the Schema tab: Permissive - The processor passes the field with a null value to the output record and continues processing. Drop Malformed - The processor drops the record from the pipeline and continues processing. Fail Fast - The processor stops the pipeline with an error.

Configuring an XML Parser Processor

Configure an XML Parser processor to parse an XML document in a string field.

In the Properties panel, on the General tab, configure the following properties:


General Property	Description
Name	Stage name.
Description	Optional description.
Cache Data	Caches data processed for a batch so the data can be reused for multiple downstream stages. Use to improve performance when the stage passes data to multiple stages. Caching can limit pushdown optimization when the pipeline runs in ludicrous mode.

On the XML Parser tab, configure the following properties:


XML Parser Property	Description
XML Field	Field that contains the XML document to parse.
Replace Field	Replaces the XML document in the field with the parsed data.
Output Field	Field where the processor writes the parsed data. You can specify an existing field or a new field. If the field does not exist, the XML Parser processor creates the field. Available when Replace Field is cleared.
Fail if Missing	Stops the pipeline with an error if the specified XML field does not exist in a record. When cleared, the processor passes the record to the next stage and then continues processing.

On the Schema tab, configure the following properties:


Schema Property	Description
Schema Mode	Mode that determines the schema to use when processing data: Infer from Data The processor infers the field names and data types from the data. Use Custom Schema - JSON Format The processor uses a custom schema defined in the JSON format. Use Custom Schema - DDL Format The processor uses a custom schema defined in the DDL format. Note: To infer the schema, the processor requires that Apache Spark version 2.4.0 or later is installed on the Transformer machine and on each node in the cluster.
Schema	Custom schema to use to process the data. Enter the schema in JSON or DDL format, depending on the selected schema mode.
Error Handling	Determines how the stage handles errors when the field to parse contains invalid XML or when a field in the XML document is not compatible with the data type defined in the schema: Permissive - Either writes the invalid XML in a `_corrupt_record` field or passes the field with a null value to the output record; continues processing. Drop Malformed - Drops the record from the pipeline and continues processing. Fail Fast - Stops the pipeline with an error.