Google Big Query
The Google Big Query origin reads data from a Google BigQuery table or view. Use the origin in Databricks or Dataproc cluster pipelines only. To use the origin in Databricks clusters, you must configure specific Spark properties.
When you configure the origin, you specify the dataset and the name of the table or view. The origin reads the entire table or view by default. You can configure the origin to process only the specified columns. You can also limit the query by defining a filter condition to include in a WHERE clause.
You indicate if the origin should run in incremental mode or full query mode. When running in incremental mode, you define the offset column and initial offset.
You can specify the number of workers that the origin uses to read from BigQuery.
You can configure the origin to load data only once and cache the data for reuse throughout the pipeline run. Or, you can configure the origin to cache each batch of data so the data can be passed to multiple downstream batches efficiently. You can also configure the origin to skip tracking offsets.
Incremental and Full Query Mode
The Google Big Query origin can run in full query mode or in incremental mode. By default, the origin runs in full query mode.
When the origin runs in full query mode, the origin processes all available data with each batch of data.
When the origin runs in incremental mode, the origin reads all available data with the first batch of data, starting from the specified initial offset. For subsequent batches, the origin reads data that arrived since the prior batch. If you stop the pipeline, the origin starts processing data from the last-saved offset when you restart the pipeline, unless you reset the pipeline offsets.
Offset Column and Supported Types
When you run the Google Big Query origin in incremental mode, you specify an offset column and initial offset.
The offset column should contain unique, incremental values and should not contain null values. The origin does not process records with null offset values.
- Date, Timestamp
- Any supported numeric type, including Numeric and Int64
- String
BigQuery Data Types
The following table lists the BigQuery data types that the Google Big Query origin supports and the Spark data types that Transformer converts them to.
BigQuery Data Type | Spark Data Type |
---|---|
Array | Array |
Bool | Boolean |
Bytes | Binary |
Date | Date |
Datetime | String |
Float64 | Double |
Int64 | Long |
Numeric | Decimal |
String | String |
Struct | Struct |
Time | Long, microseconds since midnight |
Timestamp | Timestamp |
Configuring a Big Query Origin
Use the Google Big Query origin to read from a BigQuery table or view. Use the origin in Databricks or Dataproc cluster pipelines only. To use the origin in Databricks clusters, you must configure specific Spark properties.
-
On the Properties panel, on the General tab, configure the
following properties:
General Property Description Name Stage name. Description Optional description. Stage Library Stage library to use to connect to Google Cloud: - Google Cloud cluster-provided libraries - The cluster where the pipeline runs has Google Cloud libraries installed, and therefore has all of the necessary libraries to run the pipeline.
- Google Cloud Transformer-provided libraries - Transformer passes the necessary libraries with the pipeline
to enable running the pipeline.
Use when running the pipeline locally or when the cluster where the pipeline runs does not include the Google Cloud libraries for Hadoop.
Select the appropriate version for your cluster.
Note: When using additional Google Cloud or Google Big Query stages in the pipeline, ensure that they use the same stage library.Cache Data Caches data processed for a batch so the data can be reused for multiple downstream stages. Use to improve performance when the stage passes data to multiple stages. Caching can limit pushdown optimization when the pipeline runs in ludicrous mode.
Available when Load Data Only Once is not enabled. When the origin loads data once, the origin caches data for the entire pipeline run.
Load Data Only Once Reads data while processing the first batch of a pipeline run and caches the results for reuse throughout the pipeline run. Select this property for lookup origins. When configuring lookup origins, do not limit the batch size. All lookup data should be read in a single batch.
Skip Offset Tracking Skips tracking offsets. The origin reads all available data for each batch that the pipeline processes, limited by any batch-size configuration for the origin.
-
On the Big Query tab, configure the following
properties:
Big Query Property Description Dataset Google BigQuery dataset. Table Google BigQuery table or view to read. Note: The name of the table or view can include ASCII alphanumeric characters and underscores (_). The name must not contain other special characters or spaces.Columns to Read Columns in the BigQuery table or view to read. By default, the origin reads all columns. Specify columns to reduce the number of columns included in the read.
Filter Condition Condition to limit the query that the origin performs. The filter condition is pushed down to BigQuery except for those on nested fields. You can specify any condition valid for the database that you would place into a WHERE clause of a BigQuery SQL query. Do not include WHERE in the predicate.
For example, to query rows with IDs greater than 1000 in an
id
column , enterid > 1000
.Incremental Mode Enables the origin to run in incremental mode. When not enabled, the origin runs in full query mode.
Offset Column Column to track the progress of the read. The offset column must be of a supported type. The offset column should contain unique, incremental values and should not contain null values. The origin does not process records with null offset values.
Having an index on this column is strongly encouraged since the underlying query uses an ORDER BY clause and inequality operators on this column.
Required in incremental mode.
Initial Offset Initial offset value to read from. Used when the pipeline starts for the first time. Also used after you reset the pipeline offsets.
Required in incremental mode.
Max Readers Maximum number of readers to use. Determines the number of partitions that are used during processing. The number of readers used depends on BigQuery limits and the volume of data in the table or view.
Default is 0, which allows one reader for each 400 MB of data.