Azure Data Lake Storage Gen2 (Legacy)

The origin uses the Hadoop FileSystem interface to read data from Microsoft Azure Data Lake Storage Gen2. The origin can create multiple threads to enable parallel processing in a multithreaded . For information about supported versions, see Supported Systems and VersionsSupported Systems and Versions in the Data Collector documentation.

Tip: Data Collector provides several Azure storage origins to address different needs. For a quick comparison chart to help you choose the right one, see ../../../reusable-content/datacollector/../../datacollector/UserGuide/Origins/Origins-AzureComparison.html. For all new development, use one of the other Azure storage origins which provide better performance.

The files to be processed must all share a file name pattern and be fully written. Use the origin only in configured for standalone execution mode.

When you configure the origin, you define the directory to use, the read order, the file name pattern, the file name pattern mode, and the first file to process. You can use glob patterns or regular expressions to define the file name pattern that you want to use.

You can configure the origin to read from subdirectories when the origin reads files by last modified timestamp. To use multiple threads for processing, specify the number of threads to use.

You can also enable reading compressed files. After processing a file, the origin can keep, archive, or delete the file.

When a stops, the origin notes where it stops reading. When the starts again, the origin continues processing from where it stopped by default. You can reset the originreset the originreset the origin to process all requested files.

Note: The origin processes files based on file names and locations. Having files with the same name in the same location can cause the origin to skip reading the duplicate files.

The origin generates record header attributes that enable you to use the origins of a record in processing.

You can also use a connectionconnectionconnection to configure the origin.

The origin can generate events for an event stream. For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.