Unity Catalog

The Unity Catalog destination writes data to a Databricks Unity Catalog table. Use the destination only in Databricks pipelines.

The destination can write data to a new or existing managed table or external table. For an external table, the destination can write to any external system supported by Databricks Unity Catalog.

When you configure the destination, you specify the table type. When writing to an external table, you specify the table location and file type. You can also specify additional file options to use.

You define the catalog, schema, and table name to write to as well as the write mode to use. With some write modes, you can configure the destination to update or overwrite the existing schema, and to use partition columns.

Table Creation

The Unity Catalog destination can create a managed or external Unity Catalog table, as needed. If you configure the destination to write to a table that does not exist, the destination creates a table of that name in the specified location.

If you use the Overwrite Data write mode and specify partitions, the destination includes partitions when creating the table.

Partitioning

The Unity Catalog destination can partition data by columns. The partitioning differs depending on whether the table exists and whether you define partition columns:
New table
When the Unity Catalog destination writes to a new table and partition columns are not defined in stage properties, the destination uses the same number of partitions that Spark uses to process the upstream pipeline stages. The destination randomly redistributes the data to balance the data across the partitions, and then writes one output file for each partition to the specified table path. For example, if Spark splits the pipeline data into 20 partitions, the destination writes 20 output files to the specified table path.

When the destination writes to a new table and partition columns are defined in stage properties, the destination redistributes the data by the specified column, placing records with the same value for the specified column in the same partition. The destination creates a single file for each partition, writing each file to a subfolder within the table path.

You can define partition columns when you configure the destination to use the Overwrite Data write mode.
Existing table
When the Unity Catalog destination writes to an existing table and partition columns are not defined in stage properties, the destination automatically uses the same partitioning as the existing table.
When the destination writes to an existing table with partition columns defined in stage properties, the destination writes based on the specified partition columns.
You can define partition columns when you configure the destination to use the Overwrite Data write mode.

Write Mode

The write mode determines how the Unity Catalog destination writes to Databricks Unity Catalog. The destination includes the following write modes:
Overwrite data
The destination drops and recreates the table with each batch of data, using any specified partition columns. To avoid overwriting data unintentionally, use this write mode only with batch execution mode pipelines.
If the table does not exist, the Unity Catalog destination creates the table based on the data in the batch. The creation differs depending on two schema properties in the destination:
  • Merge Schema property - When enabled, the destination adds new columns to the existing schema, retaining the existing columns. This option is available for managed tables only.
  • Overwrite Schema - When enabled, the destination creates a new schema entirely based on the data in the batch.
For example, say you have a Unity Catalog destination that writes to an existing pipeline with the Overwrite Data write mode and the Overwrite Schema property enabled, in a batch pipeline. When the pipeline starts, the destination drops the table and creates a new table based on the data, then stops the pipeline. If it were a streaming pipeline, before writing another batch of data, the destination would drop and recreate the table again, based on the second batch of data.
Append data
Appends data to existing data in the table.
When writing to a managed table, you can enable the Merge Schema property to add columns to the table as needed.
Error if exists
Generates an error that stops the pipeline if the table exists.
Ignore
Ignores data in the pipeline if the table exists, writing no data to the table.
Note: Though the data is not written, pipeline statistics indicate the number of rows that would have been written to the destination.

Configuring a Unity Catalog Destination

Configure a Unity Catalog destination to write to a Databricks Unity Catalog table. Use the destination only in Databricks pipelines.

  1. On the Properties panel, on the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
  2. On the Unity Catalog tab, configure the following properties:
    Unity Catalog Property Description
    Table Type Type of table to write to:
    • Managed
    • External
    Bucket and Table Path Bucket and path to the table to write to. This property is case-sensitive.

    Use the appropriate format. For example:

    For AWS: s3://<bucket-path>/<table-path>

    For Azure: abfss://<container>@<storageAccount>.dfs.core.windows.net/<path to folder>

    Available for external tables.

    File Format Format of the files to write:
    • Delta
    • CSV
    • JSON
    • Avro
    • Parquet
    • Orc
    • Text
    File Format Options Additional file format options to use.

    For more information about supported file formats, see the Databricks documentation.

    Catalog Name Catalog containing the table to write to. The catalog must exist before the pipeline runs.
    Schema Name Schema containing the table to write to. The schema must exist before the pipeline runs.
    Table Name Table to write to.
    Write Mode Write mode:
    • Overwrite Data - Drops and creates a table with each batch, before the write.
    • Append Data - Appends data to the table.
    • Error if Exists - Generates an error that stops the pipeline with an error if the table exists.
    • Ignore - Ignores data in the pipeline the table exists. The destination does not write pipeline data to the table.
    Partition Columns Columns to partition by.

    Available when overwriting data.

    Merge Schema Updates the existing schema with additional columns, as needed.

    Available when appending or overwriting data to a managed table.

    Overwrite Schema Creates a new schema based on the data for every batch.

    Available when overwriting data.