Whole Directory

The Whole Directory origin reads all files within the specified directory on HDFS or a local file system in a single batch. Every file must be fully written, include data of the same supported format, and use the same schema.

Important: The Whole Directory origin does not track offsets, so the origin reads all files in the directory each time that the pipeline runs. Use the Whole Directory origin only where this behavior is appropriate.

For example, you might use the Whole Directory origin in a batch pipeline where you want to reread a directory of files each time the pipeline runs. Or, you might use the origin in a slowly changing dimension pipeline that updates partitioned file dimension data.

To read files using a more traditional origin, one that track offsets and allows caching, use the File origin.

The Whole Directory origin reads from HDFS using connection information stored in a Hadoop configuration file.

When you configure the Whole Directory origin, you specify the directory to read. You select the data format of the data and configure related properties. When processing delimited or JSON data, you can define a custom schema for reading the data and configure related properties.

You can also specify HDFS configuration properties for a HDFS-compatible system. Any specified properties override those defined in the Hadoop configuration file.