Google Big Query

The Google Big Query destination writes data to a Google BigQuery table. Use the destination in Databricks or Dataproc cluster pipelines only. To use the destination in Databricks clusters, you must configure specific Spark properties.

The Google Big Query destination stages data in a Google Cloud Storage bucket before writing it to BigQuery. The destination can write data to a new or existing BigQuery table. You can configure the destination to create a table if the specified table does not exist. You can configure the destination to truncate the table before writing each batch.

The Google Big Query destination writes to a BigQuery table based on the specified write mode. The destination can insert data to the table or merge data with existing data in the table. When inserting data, the destination can add new nullable columns to the table schema and can relax required fields to allow null values. When merging data, the destination can insert, update, and delete records based on one or more key columns and the specified merge conditions.

When you configure the Google Big Query destination, you specify the dataset, table, and temporary storage bucket. You specify whether to create a new table or to truncate an existing table. You also select the write mode to use. When merging data, you specify the join key, merge conditions, and operations.

Before you configure the Google Big Query destination, complete the prerequisite task.

Prerequisite

If necessary, create a bucket in Google Cloud Storage before you configure the Google Big Query destination. The destination temporarily stores data in the specified bucket before writing the data to BigQuery.

To ensure data integrity, use a separate bucket for every pipeline. Do not use the bucket for any other pipelines or processes.

Write Mode

You can use one of the following write modes with the Google Big Query destination:
Insert
The destination inserts all data to the table.
With this write mode, you can configure the destination to allow adding new nullable fields to the BigQuery table when data includes additional fields. You can also configure the destination to relax required fields to allow null values.
Merge
The destination merges data with existing data in the table. The destination performs inserts, updates, and deletes based on the specified merge properties.
With this write mode, you specify one or more join keys to uniquely identify records.

Merge Properties

When using the Merge write mode, you configure the following properties:
Join Keys
One or more key columns in the table. Used to perform updates and deletes and to ensure that duplicate rows do not exist for inserts. Pipeline records must include a matching field name.
The property is case sensitive.
Merge Configuration
Action that the destination performs when a record meets the specified conditions. You can specify multiple merge configurations for the destination to perform.
Important: The destination performs the writes in the specified order. Best practice is to list merge configurations with the smallest number of affected records first, progressing to the largest number of affected records. When defining multiple merge configurations of the same type, carefully consider the order that you use.
When defining a merge configuration, you define the following properties:
  • When Clause - Determines when an action is performed. Select one of the following options:
    • When Matched - Use with the Update and Delete operations. Updates or deletes a row upon finding a matching join key.
    • When Not Matched - Use with the Insert operation. Inserts a row when no matching join key exists in the table.
    • When Not Matched by Source - Use with the Delete operation. Deletes all rows in the table that do not have a matching join key in the batch of data being written. Use this option with care.

      Though supported by BigQuery, the destination does not support using the When Not Matched by Source option with the Update operation. For more about the When Not Matched by Source option, see the Google BigQuery documentation.

    For example, say the Join Key is set to id, the When Clause property is set to When Matched, and Operation property is set to Update. When an ID in the record matches an ID in the table, the record is written to the table, updating the existing row.

  • Additional Condition - An optional additional condition for the merge. Use $input. to prefix field names in records, and $target. to prefix column names in the table.

    For example, if the Join Key is id and you set an additional condition of $input.delete=true, then the merge only occurs when the ID in the record matches an ID in the table and the delete field in the record is set to true.

  • Operation - The operation to perform: Insert, Update, or Delete. The operation that you can use is dependent on the When Clause option that you select.

    For example, the Insert operation can only be used when the When Clause is set to When Not Matched.

Configuring a Google Big Query Destination

Use the Google Big Query destination to write to a BigQuery table. Use the destination in Databricks or Dataproc cluster pipelines only. To use the destination in Databricks clusters, you must configure specific Spark properties.

Before you configure the destination, complete the prerequisite task.

  1. On the Properties panel, on the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
    Stage Library Stage library to use to connect to Google Cloud:
    • Google Cloud cluster-provided libraries - The cluster where the pipeline runs has Google Cloud libraries installed, and therefore has all of the necessary libraries to run the pipeline.
    • Google Cloud Transformer-provided libraries - Transformer passes the necessary libraries with the pipeline to enable running the pipeline.

      Use when running the pipeline locally or when the cluster where the pipeline runs does not include the Google Cloud libraries for Hadoop.

      Select the appropriate version for your cluster.

    Note: When using additional Google Cloud or Google Big Query stages in the pipeline, ensure that they use the same stage library.
  2. On the Big Query tab, configure the following properties:
    Big Query Property Description
    Dataset Google BigQuery dataset to write to.
    Table Google BigQuery table to write to.
    Note: The name of the table or view can include ASCII alphanumeric characters and underscores (_). The name must not contain other special characters or spaces.
    Temporary Google Storage Bucket Google Cloud Storage bucket to use for temporary storage of data. The bucket must exist before you start the pipeline.
    Write Mode Mode to write objects:
    • Insert - The destination inserts all data to the table.
    • Merge - The destination merges data with existing data in the table.
    Join Keys One or more key columns in the table. Used to perform updates and deletes and to ensure that duplicate rows do not exist for inserts. Pipeline records must include a matching field name.

    The property is case sensitive.

    Available with the Merge write mode.

    Merge Configuration Action that the destination performs when a record meets the specified conditions. You can specify multiple merge configurations for the destination to perform.
    Important: The destination performs the writes in the specified order. Best practice is to list merge configurations with the smallest number of affected records first, progressing to the largest number of affected records. When defining multiple merge configurations of the same type, carefully consider the order that you use.
    Configure the following properties:
    • When Clause - Determines when the clause is applied: either when the record matches or does not match the specified merge condition.
      • When Matched - Use with the Update and Delete operations. Updates or deletes a row upon finding a matching join key.
      • When Not Matched - Use with the Insert operation. Inserts a row when no matching join key exists in the table.
      • When Not Matched by Source - Use with the Delete operation. Deletes all rows in the table that do not have a matching join key in the batch of data being written. Use this option with care.

        Though supported by BigQuery, the destination does not support using the When Not Matched by Source option with the Update operation. For more about the When Not Matched by Source option, see the Google BigQuery documentation.

    • Additional Condition - An optional additional condition for the merge. Use $input. to prefix field names in records, and $target. to prefix column names in the table.

    • Operation - The operation to perform: Insert, Update, or Delete. The operation that you can use is dependent on the When Clause option that you select.

    Click the Add icon to specify another merge clause.

    Available with the Merge write mode.

    Create Table Creates a table when the pipeline starts if the specified table does not exist. Creates a table based on the first batch of data processed.

    Can add nullable columns to the record if subsequent batches include additional fields.

    Truncate Table Truncates the table before writing each batch of data.
    Allow Field Addition Allows adding new nullable fields to the BigQuery table when data includes additional fields.

    Available with the Insert write mode.

    Allow Field Relaxation Allows relaxing required fields to allow null values.

    Available with the Insert write mode.