Delta Lake

The Delta Lake origin reads data from a Delta Lake table. The origin can read from a managed or unmanaged table.

The origin can only be used in a batch pipeline and does not track offsets. As a result, each time the pipeline runs, the origin reads all available data. To process a Delta Lake managed table in streaming mode, or in batch mode while tracking offsets, use the Hive origin. The Hive origin cannot process unmanaged tables.

Important: The Delta Lake origin requires that Apache Spark version 2.4.2 or later is installed on the Transformer machine and on each node in the cluster.

When you configure the Delta Lake origin, you specify the path to the table to read. You can optionally enable time travel to query older versions of the table.

You configure the storage system for the table. When reading from a table stored on Azure Data Lake Storage (ADLS) Gen1 or ADLS Gen2, you also specify connection-related details. For a table on Amazon S3 or HDFS, Transformer uses connection information stored in a Hadoop configuration file. You can configure security for connections to Amazon S3.

You can configure the origin to load data only once and cache the data for reuse throughout the pipeline run. Or, you can configure the origin to cache each batch of data so the data can be passed to multiple downstream batches efficiently.

To access a table stored on ADLS Gen1 or ADLS Gen2, complete the necessary prerequisites before you run the pipeline. Also, before you run a local pipeline for a table on ADLS Gen1, ADLS Gen2, or Amazon S3, complete these additional prerequisite tasks.