Whole Directory

The Whole Directory origin reads all files within the specified directory on HDFS or a local file system in a single batch. Every file must be fully written, include data of the same supported format, and use the same schema.

Important: The Whole Directory origin does not track offsets, so the origin reads all files in the directory each time that the pipeline runs. Use the Whole Directory origin only where this behavior is appropriate.

For example, you might use the Whole Directory origin in a batch pipeline where you want to reread a directory of files each time the pipeline runs. Or, you might use the origin in a slowly changing dimension pipeline that updates ungrouped file dimension data.

To read files using a more traditional origin, one that track offsets and allows caching, use the File origin.

The Whole Directory origin reads from HDFS using connection information stored in a Hadoop configuration file.

When you configure the Whole Directory origin, you specify the directory to read. You select the data format of the data and configure related properties. When processing delimited or JSON data, you can define a custom schema for reading the data and configure related properties.

You can also specify HDFS configuration properties for a HDFS-compatible system. Any specified properties override those defined in the Hadoop configuration file.

Data Formats

The Whole Directory origin generates records based on the specified data format.

The origin can read the following data formats:

Avro: The origin generates a record for every Avro record in an Avro container file. Each file must contain the Avro schema. The origin uses the Avro schema to generate records.; You can define an Avro schema to use. The schema must be in JSON format. You can also configure the origin to process all files in the specified locations. By default, the origin only processes files with the .avro extension.
Delimited: The origin generates a record for each line in a delimited file. You can specify a custom delimiter, quote, and escape character used in the data.; By default, the origin uses the values in the first row for field names and creates records starting with the second row in the file. The origin infers data types from the data by default.; You can clear the Includes Header property to indicate that files do not contain a header row. When files do not include a header row, the origin names the first field _c0, the second field _c1, and so on. The origin also infers data types from the data by default. You can rename the fields downstream with a Field Renamer processor, or you can specify a custom schema in the origin.; When you specify a custom schema, the origin uses the field names and data types defined in the schema, applying the first field in the schema to the first field in the record, and so on.; By default, when the origin encounters parsing errors, it stops the pipeline. When processing data with a custom schema, the origin handles parsing errors based on the configured error handling.; Files must use \n as the newline character. The origin skips empty lines.
JSON: By default, the origin generates a record for each line in a JSON Lines file. Each line in the file should contain a valid JSON object. For details, see the JSON Lines website.; If the JSON Lines file contains objects that span multiple lines, you must configure the origin to process multiline JSON objects. When processing multiline JSON objects, the origin generates a record for each JSON object, even if it spans multiple lines.; A standard, single-line JSON Lines file can be split into partitions and processed in parallel. A multiline JSON file cannot be split, so must be processed in a single partition, which can slow pipeline performance.; By default, the origin uses the field names, field order, and data types in the data.; When you specify a custom schema, the origin matches the field names in the schema to those in the data, then applies the data types and field order defined in the schema.; By default, when the origin encounters parsing errors, it stops the pipeline. When processing data with a custom schema, the origin handles parsing errors based on the configured error handling.
ORC: The origin generates a record for each row in an Optimized Row Columnar (ORC) file.
Parquet: The origin generates records for every Parquet record in the file. The file must contain the Parquet schema. The origin uses the Parquet schema to generate records.
Text: The origin generates a record for each line in a text file. The file must use \n as the newline character.; The generated record consists of a single String field named Value that contains the data.
XML: The origin generates a record for every row defined in an XML file. You specify the root tag used in files and the row tag used to define records.

Configuring a Whole Directory Origin

Configure a Whole Directory origin to read all files within a directory on HDFS or the local file system in a single batch.

On the Properties panel, on the General tab, configure the following properties:


General Property	Description
Name	Stage name.
Description	Optional description.

On the File tab, configure the following properties:


File Property	Description
Directory Path	Path to the directory to read. To read from HDFS, use the following format: `hdfs://<authority>/<path>` To read from a local file system, use the following format: `file:///<directory>`
Additional Configuration	Additional HDFS properties to pass to an HDFS-compatible file system. Specified properties override those in Hadoop configuration files. To add properties, click the Add icon and define the HDFS property name and value. You can use simple or bulk edit mode to configure the properties. Use the property names and values as expected by your version of Hadoop.

On the Data Format tab, configure the following properties:


Data Format Property	Description
Data Format	Format of the data. Select one of the following formats: Avro Delimited JSON ORC Parquet Text XML
Additional Data Format Configuration	Additional data format properties to use. Specify needed data format properties not available on configuration tabs. Properties on the configuration tabs override additional data format properties when there are conflicts. For example, for the JSON format you might add the property `allowUnquotedFieldNames` if the data has unquoted fields. With the property set to True, the origin can read JSON data with the following content: `{counter: 1}` To add properties, click the Add icon and define the property name and value. You can use simple or bulk edit mode to configure the properties. Enter the property names and values expected by your version of Hadoop.

For Avro data, optionally configure the following properties:


Avro Property	Description
Avro Schema	Optional Avro schema to use to process data. The specified Avro schema overrides any schema included in the files. Specify the Avro schema in JSON format.
Ignore Extension	Processes all files in the specified directories. When not enabled, the origin only processes files with the `.avro` extension.

For delimited data, on the Data Format tab, optionally configure the following properties:


Delimited Property	Description
Delimiter Character	Delimiter character used in the data. Select one of the available options or select Other to enter a custom character. You can enter a Unicode control character using the format `\uNNNN`, where N is a hexadecimal digit from the numbers 0-9 or the letters A-F. For example, enter `\u0000` to use the null character as the delimiter or `\u2028` to use a line separator as the delimiter.
Quote Character	Quote character used in the data.
Escape Character	Escape character used in the data
Includes Header	Indicates that the data includes a header line. When selected, the origin uses the first line to create field names and begins reading with the second line.

For JSON data, on the Data Format tab, configure the following property:


JSON Property	Description
Multiline	Enables processing multiline JSON Lines data. By default, the origin expects a single JSON object on each line of the file. Use this option to process JSON objects that span multiple lines.

For XML data, on the Data Format tab, configure the following properties:


XML Property	Description
Root Tag	Tag used as the root element. Default is ROWS, which represents a <ROWS> root element.
Row Tag	Tag used as a record delineator. Default is ROW, which represents a <ROW> record delineator element.

To use a custom schema for delimited or JSON data, click the Schema tab and configure the following properties:


Schema Property	Description
Schema Mode	Mode that determines the schema to use when processing data: Infer from Data The origin infers the field names and data types from the data. Use Custom Schema - JSON Format The origin uses a custom schema defined in the JSON format. Use Custom Schema - DDL Format The origin uses a custom schema defined in the DDL format. Note that the schema is applied differently depending on the data format of the data.
Schema	Custom schema to use to process the data. Enter the schema in DDL or JSON format, depending on the selected schema mode.
Error Handling	Determines how the origin handles parsing errors: Permissive - When the origin encounters a problem parsing any field in the record, it creates a record with the field names defined in the schema, but with null values in every field. Drop Malformed - When the origin encounters a problem parsing any field in the record, it drops the entire record from the pipeline. Fail Fast - When the origin encounters a problem parsing any field in the record, it stops the pipeline.
Original Data Field	Field where the data from the original record is written when the origin cannot parse the record. When writing the original record to a field, you must add the field to the custom schema as a String field. Available when using permissive error handling.