Unity Catalog

The Unity Catalog destination writes data to a Databricks Unity Catalog table. Use the destination only in Databricks pipelines.

The destination can write data to a new or existing managed table or external table. For an external table, the destination can write to any external system supported by Databricks Unity Catalog.

When you configure the destination, you specify the table type. When writing to an external table, you specify the table location and file type. You can also specify additional file options to use.

You define the catalog, schema, and table name to write to as well as the write mode to use. With some write modes, you can configure the destination to update or overwrite the existing schema, and to use partition columns.

Table Creation

The Unity Catalog destination can create a managed or external Unity Catalog table, as needed. If you configure the destination to write to a table that does not exist, the destination creates a table of that name in the specified location.

If you use the Overwrite Data write mode and specify partitions, the destination includes partitions when creating the table.

Partitioning

The Unity Catalog destination can partition data by columns. The partitioning differs depending on whether the table exists and whether you define partition columns:
New table
When the Unity Catalog destination writes to a new table and partition columns are not defined in stage properties, the destination uses the same number of partitions that Spark uses to process the upstream pipeline stages. The destination randomly redistributes the data to balance the data across the partitions, and then writes one output file for each partition to the specified table path. For example, if Spark splits the pipeline data into 20 partitions, the destination writes 20 output files to the specified table path.

When the destination writes to a new table and partition columns are defined in stage properties, the destination redistributes the data by the specified column, placing records with the same value for the specified column in the same partition. The destination creates a single file for each partition, writing each file to a subfolder within the table path.

You can define partition columns when you configure the destination to use the Overwrite Data write mode.
Existing table
When the Unity Catalog destination writes to an existing table and partition columns are not defined in stage properties, the destination automatically uses the same partitioning as the existing table.
When the destination writes to an existing table with partition columns defined in stage properties, the destination writes based on the specified partition columns.
You can define partition columns when you configure the destination to use the Overwrite Data write mode.

Write Mode

The write mode determines how the Unity Catalog destination writes to Databricks Unity Catalog. The destination includes the following write modes:
Overwrite data
The destination drops and recreates the table with each batch of data, using any specified partition columns. To avoid overwriting data unintentionally, use this write mode only with batch execution mode pipelines.
If the table does not exist, the Unity Catalog destination creates the table based on the data in the batch. The creation differs depending on two schema properties in the destination:
  • Merge Schema property - When enabled, the destination adds new columns to the existing schema, retaining the existing columns. This option is available for managed tables only.
  • Overwrite Schema - When enabled, the destination creates a new schema entirely based on the data in the batch.
For example, say you have a Unity Catalog destination that writes to an existing pipeline with the Overwrite Data write mode and the Overwrite Schema property enabled, in a batch pipeline. When the pipeline starts, the destination drops the table and creates a new table based on the data, then stops the pipeline. If it were a streaming pipeline, before writing another batch of data, the destination would drop and recreate the table again, based on the second batch of data.
Append data
Appends data to existing data in the table.
When writing to a managed table, you can enable the Merge Schema property to add columns to the table as needed.
6.1 and later Upsert using merge
Merges new data with existing data in the table based on the specified merge condition and merge clauses.
For more information, see Upsert Using Merge.
Error if exists
Generates an error that stops the pipeline if the table exists.
Ignore
Ignores data in the pipeline if the table exists, writing no data to the table.
Note: Though the data is not written, pipeline statistics indicate the number of rows that would have been written to the destination.

Upsert Using Merge

6.1 and later The Upsert Using Merge write mode merges new data with existing data in the table based on the specified merge condition and merge clauses. This write mode is available with Transformer 6.1 and later.

You can use the Upsert Using Merge write mode to perform inserts, updates, and deletes to handle use cases like change data capture (CDC). You can also use the write mode to perform any subset of those operations, such as inserts only, or inserts and updates but no deletes.

When using the Upsert Using Merge write mode, you configure the following properties:
Merge Condition
The merge condition defines when the destination uses the specified merge configurations to write to the table.
When defining the merge condition, use the following guidelines:
  • To reference a column in the table, use original.<column name>.
  • To reference a matching field in pipeline data, use updates.<field name>.
For example, to perform the specified merge configurations when the id field in the data matches the id column in the destination table, set the merge condition to:
original.id = updates.id
Merge Clause
The merge clause specifies the actions the destination performs when a record meets the specified merge condition. You can specify multiple merge clauses for the destination to perform.
When defining a merge clause, you configure the following properties:
  • When Clause - Determines when the clause is applied: either when the record matches or does not match the specified merge condition.

    Use the When Matched option with the Update and Delete operations. Use the When Not Matched option with the Insert operation.

    For example, say the Merge Condition property is set to original.id = updates.id, the When Clause property is set to When Matched, and Operation property is set to Update. When an ID in the record matches an ID in the table, the record is written to the table, updating the existing row.

    Important: The destination performs merge clauses in the specified order. When configuring multiple merge clauses of the same type, carefully consider the order that you use.

    Best practice is to list merge clauses with the least affected records first, progressing to the most records affected. Typically, this results in listing deletes before updates.

  • Additional Condition - An optional condition to apply. Use to narrow the condition on which the merge occurs.

    For example, if the merge condition is original.id = updates.id and you set an additional condition of updates.delete=true, then the merge occurs only when the ID in the record matches an ID in the table and the delete field in the record is set to true.

  • Operation - The operation to perform: Insert, Update, or Delete. The operation that you can use is dependent on the When Clause option that you select.

Configuring a Unity Catalog Destination

Configure a Unity Catalog destination to write to a Databricks Unity Catalog table. Use the destination only in Databricks pipelines.

  1. On the Properties panel, on the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
  2. On the Unity Catalog tab, configure the following properties:
    Unity Catalog Property Description
    Table Type Type of table to write to:
    • Managed
    • External
    Bucket and Table Path Bucket and path to the table to write to. This property is case-sensitive.

    Use the appropriate format. For example:

    For AWS: s3://<bucket-path>/<table-path>

    For Azure: abfss://<container>@<storageAccount>.dfs.core.windows.net/<path to folder>

    Available for external tables.

    File Format Format of the files to write:
    • Delta
    • CSV
    • JSON
    • Avro
    • Parquet
    • Orc
    • Text
    File Format Options Additional file format options to use.

    For more information about supported file formats, see the Databricks documentation.

    Catalog Name Catalog containing the table to write to. The catalog must exist before the pipeline runs.
    Schema Name Schema containing the table to write to. The schema must exist before the pipeline runs.
    Table Name Table to write to.
    Write Mode Write mode:
    • Overwrite Data - Drops and creates a table with each batch, before the write.
    • Append Data - Appends data to the table.
    • Error if Exists - Generates an error that stops the pipeline with an error if the table exists.
    • 6.1 and later Upsert using Merge - Merges new data with existing data in the table based on the specified merge condition and merge clauses.

      You can use this write mode to perform inserts, updates, and deletes.

    • Ignore - Ignores data in the pipeline the table exists. The destination does not write pipeline data to the table.
    Partition Columns Columns to partition by.

    Available when overwriting data.

    Merge Schema Updates the existing schema with additional columns, as needed.

    Available when appending or overwriting data to a managed table.

    Overwrite Schema Creates a new schema based on the data for every batch.

    Available when overwriting data.

    6.1 and later

    Merge Condition

    Condition on which the merge clauses are based.
    When defining the merge condition, use the following guidelines:
    • To reference a column in the table, use original.<column name>.
    • To reference a matching field in pipeline data, use updates.<field name>.

    Available when upserting data.

    6.1 and later

    Merge Clause

    Actions performed when a record meets the specified merge condition. Configure the following properties for each merge clause, as needed.
    • When Clause - Determines when the clause is applied: either when the record matches or does not match the specified merge condition.
      Use the When Matched option with the Update and Delete operations. Use the When Not Matched option with the Insert operation.
      Important: The destination performs merge clauses in the specified order. When configuring multiple merge clauses of the same type, carefully consider the order that you use.
    • Additional Condition - An optional condition to apply. Use to narrow the condition on which the merge occurs.
    • Operation - The operation to perform: Insert, Update, or Delete. The operation that you can use is dependent on the When Clause option that you select.
    Click the Add icon to specify another merge clause.

    Available when upserting data.