ADLS Gen1
The origin connects to Azure Data Lake Storage Gen1 using Azure Active Directory service principal authentication, also known as service-to-service authentication. When reading multiple files in a batch, the origin reads the oldest file first. Upon successfully reading a file, the origin can delete the file, move it to an archive directory, or leave it in the directory.
When the pipeline stops, the origin notes the last-modified timestamp of the last file that it processed and stores it as an offset. When the pipeline starts again, the origin continues processing from the last-saved offset by default. When needed, you can reset pipeline offsets to process all available files.
Before you use the origin, you must perform some prerequisite tasks.
When you configure the ADLS Gen1 origin, you specify the service name and Azure authentication information such as the application ID and key. Or, you can have the origin use Azure authentication information configured in the cluster where the pipeline runs.
You configure the directory path to use and a name pattern for the files to read. The origin reads the files with matching names in the specified directory and its subdirectories. You can also configure a file name pattern for a subset of files to exclude from processing. You specify the data format of the data, related data format properties, and how to process successfully read files. When needed, you can define a maximum number of files to read in a batch.
You select the data format of the data and configure related properties. When processing delimited or JSON data, you can define a custom schema for reading the data and configure related properties.
You can configure the origin to load data only once and cache the data for reuse throughout the pipeline run. Or, you can configure the origin to cache each batch of data so the data can be passed to multiple downstream batches efficiently. You can also configure the origin to skip tracking offsets, which enables reading the entire data set each time you start the pipeline.