Unity Catalog

The Unity Catalog origin reads data from a Databricks Unity Catalog managed table. Use the origin only in Databricks cluster pipelines.

The origin can perform a full or incremental read. The origin performs a full read of all available data, by default.

When you configure the origin, you specify the catalog, schema, and table to read from. If you configure the origin to perform an incremental read, you specify the offset column and initial offset to use.

Full or Incremental Read

The Unity Catalog origin can perform a full read or an incremental read each time you run the pipeline. By default, the origin performs a full read of the specified table.

When the origin performs a full read, the origin processes all data available in the table each time that the pipeline runs.

When the origin performs an incremental read, the first pipeline run is the same as a full read. When the pipeline stops, the origin stores the offset where it stopped processing. For subsequent pipeline runs, the origin reads the table starting from the last-saved offset, unless you reset the pipeline offsets.

When you configure the origin to perform an incremental read, you specify the offset column and initial offset to use. As a best practice, an offset column should be an incremental and unique column that does not contain null values. Having an index on this column is strongly encouraged since the underlying query uses an ORDER BY clause and inequality operators on this column.

Configuring a Unity Catalog Origin

Configure a Unity Catalog origin to read from a Databricks Unity Catalog managed table. Use the origin only in Databricks cluster pipelines.

On the Properties panel, on the General tab, configure the following properties:


General Property	Description
Name	Stage name.
Description	Optional description.
Load Data Only Once	Reads data while processing the first batch of a pipeline run and caches the results for reuse throughout the pipeline run. Select this property for lookup origins. When configuring lookup origins, do not limit the batch size. All lookup data should be read in a single batch. Do not select this property when you configure the origin to read from more than one table.
Cache Data	Caches data processed for a batch so the data can be reused for multiple downstream stages. Use to improve performance when the stage passes data to multiple stages. Caching can limit pushdown optimization when the pipeline runs in ludicrous mode. Available when Load Data Only Once is not enabled. When the origin loads data once, the origin caches data for the entire pipeline run.

On the Unity Catalog tab, configure the following properties:


Unity Catalog Property	Description
Catalog Name	Catalog for the table to read.
Schema Name	Schema for the table to read.
Table Name	Table to read.
Incremental Mode	Enables the origin to read data from a specified initial offset for the first pipeline run, and then from the last-saved offset during subsequent pipeline runs.
Initial Offset	Initial offset value to use when you start the pipeline. Available only for incremental reads.
Offset Column	Column to track the progress of the read. Available only for incremental reads.