File
The File origin reads data from files in Hadoop Distributed File System (HDFS) or a local file system. Every file must be fully written, include data of the same supported format, and use the same schema.
The File origin reads from HDFS using connection information stored in a Hadoop configuration file.
When reading multiple files in a batch, the origin reads the oldest file first. Upon successfully reading a file, the origin can delete the file, move it to an archive directory, or leave it in the directory.
When the pipeline stops, the origin notes the last-modified timestamp of the last file that it processed and stores it as an offset. When the pipeline starts again, the origin continues processing from the last-saved offset by default. When needed, you can reset pipeline offsets to process all available files.
When you configure the File origin, you specify the directory path to use and a name pattern for the files to read. The origin reads the files with matching names in the specified directory and its subdirectories. If the origin reads partition files grouped by field, you must specify the partition base path to include the fields and field values in the data. You can also configure a file name pattern for a subset of files to exclude from processing. You specify the data format of the data, related data format properties, and how to process successfully read files. When needed, you can define a maximum number of files to read in a batch.
You select the data format of the data and configure related properties. When processing delimited or JSON data, you can define a custom schema for reading the data and configure related properties.
You can also specify HDFS configuration properties for a HDFS-compatible system. Any specified properties override those defined in the Hadoop configuration file.
You can configure the origin to load data only once and cache the data for reuse throughout the pipeline run. Or, you can configure the origin to cache each batch of data so the data can be passed to multiple downstream batches efficiently. You can also configure the origin to skip tracking offsets.
Schema Requirement
All files processed by the File origin must have the same schema.
When files have different schemas, the resulting behavior depends on the data format and the version of Spark that you use. For example, the origin might skip processing delimited files with a different schema, but add null values to Parquet files with a different schema.
Directory Path
When you configure the File origin, you specify the directory path to use. The origin reads all files in the specified directory and its subdirectories. You can use glob patterns in the directory path to specify a set of directories to read from.
In each batch, the origin reads any files added to the directory path since the last batch completed.
The format of the directory path depends on the file system that you want to read from:
- HDFS
- To read files in HDFS, use the following format for the directory path:
- Local file system
- To read files in a local file system, use the following format for the directory path:
Partitioning
Spark runs a Transformer pipeline just as it runs any other application, splitting the data into partitions and performing operations on the partitions in parallel. When the pipeline starts processing a new batch, Spark determines how to split pipeline data into initial partitions based on the origins in the pipeline.
- Delimited, JSON, text, or XML
- When reading text-based files from a local file system, Spark creates one partition for each file being read.
- Avro, ORC, or Parquet
- When reading Avro, ORC, or Parquet
files, Spark can split the file into multiple partitions for processing.
Spark uses these partitions while the pipeline processes the batch unless a processor causes Spark to shuffle the data. To change the partitioning in the pipeline, use the Repartition processor.
Data Formats
The File origin generates records based on the specified data format.
- Avro
- The origin generates a record for every Avro record in an Avro container file. Each file must contain the Avro schema. The origin uses the Avro schema to generate records.
- Delimited
- The origin generates a record for each line in a delimited file. You can specify a custom delimiter, quote, and escape character used in the data.
- JSON
- By default, the origin generates a record for each line in a JSON Lines file. Each line in the file should contain a valid JSON object. For details, see the JSON Lines website.
- ORC
- The origin generates a record for each row in an Optimized Row Columnar (ORC) file.
- Parquet
- The origin generates records for every Parquet record in the file. The file must contain the Parquet schema. The origin uses the Parquet schema to generate records.
- Text
- The origin generates a record for each line in a text file. The file must
use
\n
as the newline character. - XML
- The origin generates a record for every row defined in an XML file. You specify the root tag used in files and the row tag used to define records.