Google Big Query
The Google Big Query destination writes data to a Google BigQuery table. Use the destination in Databricks or Dataproc cluster pipelines only. To use the destination in Databricks clusters, you must configure specific Spark properties.
The Google Big Query destination stages data in a Google Cloud Storage bucket before writing it to BigQuery. The destination can write data to a new or existing BigQuery table. You can configure the destination to create a table if the specified table does not exist. You can configure the destination to truncate the table before writing each batch.
The Google Big Query destination writes to a BigQuery table based on the specified write mode. The destination can insert data to the table or merge data with existing data in the table. When inserting data, the destination can add new nullable columns to the table schema and can relax required fields to allow null values. When merging data, the destination can insert, update, and delete records based on one or more key columns and the specified merge conditions.
When you configure the Google Big Query destination, you specify the dataset, table, and temporary storage bucket. You specify whether to create a new table or to truncate an existing table. You also select the write mode to use. When merging data, you specify the join key, merge conditions, and operations.
Before you configure the Google Big Query destination, complete the prerequisite task.
Prerequisite
If necessary, create a bucket in Google Cloud Storage before you configure the Google Big Query destination. The destination temporarily stores data in the specified bucket before writing the data to BigQuery.
To ensure data integrity, use a separate bucket for every pipeline. Do not use the bucket for any other pipelines or processes.
Write Mode
- Insert
- The destination inserts all data to the table.
- Merge
- The destination merges data with existing data in the table. The destination performs inserts, updates, and deletes based on the specified merge properties.
Merge Properties
- Join Keys
- One or more key columns in the table. Used to perform updates and deletes and to ensure that duplicate rows do not exist for inserts. Pipeline records must include a matching field name.
- Merge Configuration
- Action that the destination
performs when a record meets the specified conditions. You can specify
multiple merge configurations for the destination to perform. Important: The destination performs the writes in the specified order. Best practice is to list merge configurations with the smallest number of affected records first, progressing to the largest number of affected records. When defining multiple merge configurations of the same type, carefully consider the order that you use.
Configuring a Google Big Query Destination
Use the Google Big Query destination to write to a BigQuery table. Use the destination in Databricks or Dataproc cluster pipelines only. To use the destination in Databricks clusters, you must configure specific Spark properties.
Before you configure the destination, complete the prerequisite task.
-
On the Properties panel, on the General tab, configure the
following properties:
General Property Description Name Stage name. Description Optional description. Stage Library Stage library to use to connect to Google Cloud: - Google Cloud cluster-provided libraries - The cluster where the pipeline runs has Google Cloud libraries installed, and therefore has all of the necessary libraries to run the pipeline.
- Google Cloud Transformer-provided libraries - Transformer passes the necessary libraries with the pipeline
to enable running the pipeline.
Use when running the pipeline locally or when the cluster where the pipeline runs does not include the Google Cloud libraries for Hadoop.
Select the appropriate version for your cluster.
Note: When using additional Google Cloud or Google Big Query stages in the pipeline, ensure that they use the same stage library. -
On the Big Query tab, configure the following
properties:
Big Query Property Description Dataset Google BigQuery dataset to write to. Table Google BigQuery table to write to. Note: The name of the table or view can include ASCII alphanumeric characters and underscores (_). The name must not contain other special characters or spaces.Temporary Google Storage Bucket Google Cloud Storage bucket to use for temporary storage of data. The bucket must exist before you start the pipeline. Write Mode Mode to write objects: - Insert - The destination inserts all data to the table.
- Merge - The destination merges data with existing data in the table.
Join Keys One or more key columns in the table. Used to perform updates and deletes and to ensure that duplicate rows do not exist for inserts. Pipeline records must include a matching field name. The property is case sensitive.
Available with the Merge write mode.
Merge Configuration Action that the destination performs when a record meets the specified conditions. You can specify multiple merge configurations for the destination to perform. Important: The destination performs the writes in the specified order. Best practice is to list merge configurations with the smallest number of affected records first, progressing to the largest number of affected records. When defining multiple merge configurations of the same type, carefully consider the order that you use.Configure the following properties:- When Clause - Determines when the clause is applied:
either when the record matches or does not match the specified merge
condition.
- When Matched - Use with the Update and Delete operations. Updates or deletes a row upon finding a matching join key.
- When Not Matched - Use with the Insert operation. Inserts a row when no matching join key exists in the table.
- When Not Matched by Source - Use with the Delete operation.
Deletes all rows in the table that do not have a matching
join key in the batch of data being written. Use this option
with care.
Though supported by BigQuery, the destination does not support using the When Not Matched by Source option with the Update operation. For more about the When Not Matched by Source option, see the Google BigQuery documentation.
-
Additional Condition - An optional additional condition for the merge. Use
$input.
to prefix field names in records, and$target.
to prefix column names in the table. - Operation - The operation to perform: Insert, Update, or Delete. The operation that you can use is dependent on the When Clause option that you select.
Click the Add icon to specify another merge clause.
Available with the Merge write mode.
Create Table Creates a table when the pipeline starts if the specified table does not exist. Creates a table based on the first batch of data processed. Can add nullable columns to the record if subsequent batches include additional fields.
Truncate Table Truncates the table before writing each batch of data. Allow Field Addition Allows adding new nullable fields to the BigQuery table when data includes additional fields. Available with the Insert write mode.
Allow Field Relaxation Allows relaxing required fields to allow null values. Available with the Insert write mode.