File
The File origin reads data from files in Hadoop Distributed File System (HDFS) or a local file system. Every file must be fully written, include data of the same supported format, and use the same schema.
The File origin reads from HDFS using connection information stored in a Hadoop configuration file.
When reading multiple files in a batch, the origin reads the oldest file first. Upon successfully reading a file, the origin can delete the file, move it to an archive directory, or leave it in the directory.
When the pipeline stops, the origin notes the last-modified timestamp of the last file that it processed and stores it as an offset. When the pipeline starts again, the origin continues processing from the last-saved offset by default. When needed, you can reset pipeline offsets to process all available files.
When you configure the File origin, you specify the directory path to use and a name pattern for the files to read. The origin reads the files with matching names in the specified directory and its subdirectories. You can also configure a file name pattern for a subset of files to exclude from processing. You specify the data format of the data, related data format properties, and how to process successfully read files. When needed, you can define a maximum number of files to read in a batch.
You select the data format of the data and configure related properties. When processing delimited or JSON data, you can define a custom schema for reading the data and configure related properties.
You can also specify HDFS configuration properties for a HDFS-compatible system. Any specified properties override those defined in the Hadoop configuration file.
You can configure the origin to load data only once and cache the data for reuse throughout the pipeline run. Or, you can configure the origin to cache each batch of data so the data can be passed to multiple downstream batches efficiently. You can also configure the origin to skip tracking offsets, which enables reading the entire data set each time you start the pipeline.