ADLS Gen2

The ADLS Gen2 origin reads data from Microsoft Azure Data Lake Store Gen2. Every file must be fully written, include data of the same supported format, and use the same schema. To read from Azure Data Lake Storage Gen1, use the ADLS Gen1 origin.

Note: When this stage is included in a pipeline that runs on an Azure HDInsight cluster, use an Azure HDInsight cluster version 4.0 or later.

When reading multiple files in a batch, the origin reads the oldest file first. Upon successfully reading a file, the origin can delete the file, move it to an archive directory, or leave it in the directory.

When the pipeline stops, the origin notes the last-modified timestamp of the last file that it processed and stores it as an offset. When the pipeline starts again, the origin continues processing from the last-saved offset by default. When needed, you can reset pipeline offsets to process all available files.

Before you use the origin, you must perform some prerequisite tasks.

When you configure the ADLS Gen2 origin, you specify the Azure authentication method to use and related properties. Or, you can have the origin use Azure authentication information configured in the cluster where the pipeline runs.

You configure the directory path to use and a name pattern for the files to read. The origin reads the files with matching names in the specified directory and its subdirectories. You can also configure a file name pattern for a subset of files to exclude from processing. You specify the data format of the data, related data format properties, and how to process successfully read files. When needed, you can define a maximum number of files to read in a batch.

You select the data format of the data and configure related properties. When processing delimited or JSON data, you can define a custom schema for reading the data and configure related properties.

You can configure the origin to load data only once and cache the data for reuse throughout the pipeline run. Or, you can configure the origin to cache each batch of data so the data can be passed to multiple downstream batches efficiently. You can also configure the origin to skip tracking offsets, which enables reading the entire data set each time you start the pipeline.