Delta Lake

The Delta Lake origin reads data from a Delta Lake table. The origin can read from a managed or unmanaged table.

The origin can only be used in a batch pipeline and does not track offsets. As a result, each time the pipeline runs, the origin reads all available data. To process a Delta Lake managed table in streaming mode, or in batch mode while tracking offsets, use the Hive origin. The Hive origin cannot process unmanaged tables.

Important: The Delta Lake origin

When you configure the Delta Lake origin, you specify the path to the table to read. You can optionally enable time travel to query older versions of the table.

You configure the storage system for the table. When reading from a table stored on Azure Data Lake Storage (ADLS) Gen2, you also specify connection-related details. For a table on Amazon S3 or HDFS, Transformer uses connection information stored in a Hadoop configuration file. You can configure security for connections to Amazon S3.

You can configure the origin to load data only once and cache the data for reuse throughout the pipeline run. Or, you can configure the origin to cache each batch of data so the data can be passed to multiple downstream batches efficiently.

To access a table stored on ADLS Gen2, complete the necessary prerequisites before you run the pipeline. Also, before you run a local pipeline for a table on ADLS Gen2 or Amazon S3, complete these additional prerequisite tasks.