Google Big Query

The Google Big Query origin reads data from a Google BigQuery table or view. Use the origin in Databricks or Dataproc cluster pipelines only. To use the origin in Databricks clusters, you must configure specific Spark properties.

When you configure the origin, you specify the dataset and the name of the table or view. The origin reads the entire table or view by default. You can configure the origin to process only the specified columns. You can also limit the query by defining a filter condition to include in a WHERE clause.

You indicate if the origin should run in incremental mode or full query mode. When running in incremental mode, you define the offset column and initial offset.

You can specify the number of workers that the origin uses to read from BigQuery.

You can configure the origin to load data only once and cache the data for reuse throughout the pipeline run. Or, you can configure the origin to cache each batch of data so the data can be passed to multiple downstream batches efficiently. You can also configure the origin to skip tracking offsets.

Incremental and Full Query Mode

The Google Big Query origin can run in full query mode or in incremental mode. By default, the origin runs in full query mode.

When the origin runs in full query mode, the origin processes all available data with each batch of data.

When the origin runs in incremental mode, the origin reads all available data with the first batch of data, starting from the specified initial offset. For subsequent batches, the origin reads data that arrived since the prior batch. If you stop the pipeline, the origin starts processing data from the last-saved offset when you restart the pipeline, unless you reset the pipeline offsets.

Offset Column and Supported Types

When you run the Google Big Query origin in incremental mode, you specify an offset column and initial offset.

The offset column should contain unique, incremental values and should not contain null values. The origin does not process records with null offset values.

You can use columns of the following types as offset columns:

Date, Timestamp
Any supported numeric type, including Numeric and Int64
String

BigQuery Data Types

The following table lists the BigQuery data types that the Google Big Query origin supports and the Spark data types that Transformer converts them to.

BigQuery data types not listed in the table are not supported.


BigQuery Data Type	Spark Data Type
Array	Array
Bool	Boolean
Bytes	Binary
Date	Date
Datetime	String
Float64	Double
Int64	Long
Numeric	Decimal
String	String
Struct	Struct
Time	Long, microseconds since midnight
Timestamp	Timestamp

Configuring a Big Query Origin

Use the Google Big Query origin to read from a BigQuery table or view. Use the origin in Databricks or Dataproc cluster pipelines only. To use the origin in Databricks clusters, you must configure specific Spark properties.

On the Properties panel, on the General tab, configure the following properties:


General Property	Description
Name	Stage name.
Description	Optional description.
Stage Library	Stage library to use to connect to Google Cloud: Google Cloud cluster-provided libraries - The cluster where the pipeline runs has Google Cloud libraries installed, and therefore has all of the necessary libraries to run the pipeline. Google Cloud Transformer-provided libraries - Transformer passes the necessary libraries with the pipeline to enable running the pipeline. Use when running the pipeline locally or when the cluster where the pipeline runs does not include the Google Cloud libraries for Hadoop. Select the appropriate version for your cluster. Note: When using additional Google Cloud or Google Big Query stages in the pipeline, ensure that they use the same stage library.
Cache Data	Caches data processed for a batch so the data can be reused for multiple downstream stages. Use to improve performance when the stage passes data to multiple stages. Caching can limit pushdown optimization when the pipeline runs in ludicrous mode. Available when Load Data Only Once is not enabled. When the origin loads data once, the origin caches data for the entire pipeline run.
Load Data Only Once	Reads data while processing the first batch of a pipeline run and caches the results for reuse throughout the pipeline run. Select this property for lookup origins. When configuring lookup origins, do not limit the batch size. All lookup data should be read in a single batch.
Skip Offset Tracking	Skips tracking offsets. The origin reads all available data for each batch that the pipeline processes, limited by any batch-size configuration for the origin.

On the Big Query tab, configure the following properties:


Big Query Property	Description
Dataset	Google BigQuery dataset.
Table	Google BigQuery table or view to read. Note: The name of the table or view can include ASCII alphanumeric characters and underscores (_). The name must not contain other special characters or spaces.
Columns to Read	Columns in the BigQuery table or view to read. By default, the origin reads all columns. Specify columns to reduce the number of columns included in the read.
Filter Condition	Condition to limit the query that the origin performs. The filter condition is pushed down to BigQuery except for those on nested fields. You can specify any condition valid for the database that you would place into a WHERE clause of a BigQuery SQL query. Do not include WHERE in the predicate. For example, to query rows with IDs greater than 1000 in an `id` column , enter `id > 1000`.
Incremental Mode	Enables the origin to run in incremental mode. When not enabled, the origin runs in full query mode.
Offset Column	Column to track the progress of the read. The offset column must be of a supported type. The offset column should contain unique, incremental values and should not contain null values. The origin does not process records with null offset values. Having an index on this column is strongly encouraged since the underlying query uses an ORDER BY clause and inequality operators on this column. Required in incremental mode.
Initial Offset	Initial offset value to read from. Used when the pipeline starts for the first time. Also used after you reset the pipeline offsets. Required in incremental mode.
Max Readers	Maximum number of readers to use. Determines the number of partitions that are used during processing. The number of readers used depends on BigQuery limits and the volume of data in the table or view. Default is 0, which allows one reader for each 400 MB of data.