Unity Catalog

The Unity Catalog origin reads data from a Databricks Unity Catalog managed table. Use the origin only in Databricks cluster pipelines.

The origin can perform a full or incremental read. The origin performs a full read of all available data, by default.

When you configure the origin, you specify the catalog, schema, and table to read from. If you configure the origin to perform an incremental read, you specify the offset column and initial offset to use.

Full or Incremental Read

The Unity Catalog origin can perform a full read or an incremental read each time you run the pipeline. By default, the origin performs a full read of the specified table.

When the origin performs a full read, the origin processes all data available in the table each time that the pipeline runs.

When the origin performs an incremental read, the first pipeline run is the same as a full read. When the pipeline stops, the origin stores the offset where it stopped processing. For subsequent pipeline runs, the origin reads the table starting from the last-saved offset, unless you reset the pipeline offsets.

When you configure the origin to perform an incremental read, you specify the offset column and initial offset to use. As a best practice, an offset column should be an incremental and unique column that does not contain null values. Having an index on this column is strongly encouraged since the underlying query uses an ORDER BY clause and inequality operators on this column.

Configuring a Unity Catalog Origin

Configure a Unity Catalog origin to read from a Databricks Unity Catalog managed table. Use the origin only in Databricks cluster pipelines.

  1. On the Properties panel, on the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
    Load Data Only Once Reads data while processing the first batch of a pipeline run and caches the results for reuse throughout the pipeline run.

    Select this property for lookup origins. When configuring lookup origins, do not limit the batch size. All lookup data should be read in a single batch.

    Do not select this property when you configure the origin to read from more than one table.

    Cache Data Caches data processed for a batch so the data can be reused for multiple downstream stages. Use to improve performance when the stage passes data to multiple stages.

    Caching can limit pushdown optimization when the pipeline runs in ludicrous mode.

    Available when Load Data Only Once is not enabled. When the origin loads data once, the origin caches data for the entire pipeline run.

  2. On the Unity Catalog tab, configure the following properties:
    Unity Catalog Property Description
    Catalog Name Catalog for the table to read.
    Schema Name Schema for the table to read.
    Table Name Table to read.
    Incremental Mode Enables the origin to read data from a specified initial offset for the first pipeline run, and then from the last-saved offset during subsequent pipeline runs.
    Initial Offset Initial offset value to use when you start the pipeline.

    Available only for incremental reads.

    Offset Column Column to track the progress of the read.

    Available only for incremental reads.