Unity Catalog
The Unity Catalog destination writes data to a Databricks Unity Catalog table. Use the destination only in Databricks pipelines.
The destination can write data to a new or existing managed table or external table. For an external table, the destination can write to any external system supported by Databricks Unity Catalog.
When you configure the destination, you specify the table type. When writing to an external table, you specify the table location and file type. You can also specify additional file options to use.
You define the catalog, schema, and table name to write to as well as the write mode to use. With some write modes, you can configure the destination to update or overwrite the existing schema, and to use partition columns.
Table Creation
The Unity Catalog destination can create a managed or external Unity Catalog table, as needed. If you configure the destination to write to a table that does not exist, the destination creates a table of that name in the specified location.
If you use the Overwrite Data write mode and specify partitions, the destination includes partitions when creating the table.
Partitioning
- New table
- When the Unity Catalog destination writes
to a new table and partition columns are not defined in stage
properties, the destination uses the same number of partitions that
Spark uses to process the upstream pipeline stages. The destination
randomly redistributes the data to balance the data across the
partitions, and then writes one output file for each partition to
the specified table path. For example, if Spark splits the pipeline
data into 20 partitions, the destination writes 20 output files to
the specified table path.
When the destination writes to a new table and partition columns are defined in stage properties, the destination redistributes the data by the specified column, placing records with the same value for the specified column in the same partition. The destination creates a single file for each partition, writing each file to a subfolder within the table path.
- Existing table
- When the Unity Catalog destination writes to an existing table and partition columns are not defined in stage properties, the destination automatically uses the same partitioning as the existing table.
Write Mode
- Overwrite data
- The destination drops and recreates the table with each batch of data, using any specified partition columns. To avoid overwriting data unintentionally, use this write mode only with batch execution mode pipelines.
- Append data
- Appends data to existing data in the table.
- Error if exists
- Generates an error that stops the pipeline if the table exists.
- Ignore
- Ignores data in the pipeline if the table exists, writing no data to the table.
Configuring a Unity Catalog Destination
Configure a Unity Catalog destination to write to a Databricks Unity Catalog table. Use the destination only in Databricks pipelines.
-
On the Properties panel, on the General tab, configure the
following properties:
General Property Description Name Stage name. Description Optional description. -
On the Unity Catalog tab, configure the following
properties:
Unity Catalog Property Description Table Type Type of table to write to: - Managed
- External
Bucket and Table Path Bucket and path to the table to write to. This property is case-sensitive. Use the appropriate format. For example:
For AWS:
s3://<bucket-path>/<table-path>
For Azure:
abfss://<container>@<storageAccount>.dfs.core.windows.net/<path to folder>
Available for external tables.
File Format Format of the files to write: - Delta
- CSV
- JSON
- Avro
- Parquet
- Orc
- Text
File Format Options Additional file format options to use. For more information about supported file formats, see the Databricks documentation.
Catalog Name Catalog containing the table to write to. The catalog must exist before the pipeline runs. Schema Name Schema containing the table to write to. The schema must exist before the pipeline runs. Table Name Table to write to. Write Mode Write mode: - Overwrite Data - Drops and creates a table with each batch, before the write.
- Append Data - Appends data to the table.
- Error if Exists - Generates an error that stops the pipeline with an error if the table exists.
- Ignore - Ignores data in the pipeline the table exists. The destination does not write pipeline data to the table.
Partition Columns Columns to partition by. Available when overwriting data.
Merge Schema Updates the existing schema with additional columns, as needed. Available when appending or overwriting data to a managed table.
Overwrite Schema Creates a new schema based on the data for every batch. Available when overwriting data.