Google BigQuery
Supported pipeline types:
|
The origin submits the query that you define, and then Google BigQuery runs the query as an interactive query. When the query is complete, the origin reads the query results to generate records. The origin runs the query once and then the pipeline stops when it finishes reading all query results. If you start the pipeline again, the origin submits the query again.
When you configure the origin, you define the query to run using valid BigQuery standard SQL or legacy SQL syntax. By default, BigQuery writes all query results to a temporary, cached results table. You can choose to disable retrieving cached results and force BigQuery to compute the query result.
You also define the project ID and credentials to use when connecting to Google BigQuery.
The origin can generate events for an event stream. For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.
Credentials
When the Google BigQuery origin executes a query job and reads the result from Google BigQuery, the origin must pass credentials to Google BigQuery.
- Google Cloud default credentials
- Credentials in a file
- Credentials in a stage property
For details on how to configure each option, see Security in Google Cloud Stages.
BigQuery Data Types
The Google BigQuery origin converts the Google BigQuery data types to Data Collector data types.
BigQuery Data Type | Data Collector Data Type |
---|---|
Boolean | Boolean |
Bytes | Byte Array |
Date | Date |
Datetime | Datetime |
Float | Double |
Integer | Long |
Numeric | Decimal |
String | String |
Time | Datetime |
Timestamp | Datetime |
Datetime Conversion
In Google BigQuery, the Datetime, Time, and Timestamp data types have microsecond precision, but the corresponding Datetime data type in Data Collector has millisecond precision. The conversion between data types results in some precision loss.
To preserve potentially lost precision during data type conversion, the Google Big Query
origin generates the bq.fullValue
field attribute that stores a string
containing the original value with microsecond precision. You can use the
record:fieldAttribute
or
record:fieldAttributeOrDefault
functions to access the information
in the attribute.
Generated Field Attribute | Description |
---|---|
bq.fullValue | Provides the original precision for Datetime, Time, and Timestamp fields. |
For more information about field attributes, see Field Attributes.
Event Generation
The Google BigQuery origin generates an event when a query completes successfully.
- With the Email executor to send a custom email
after receiving an event.
For an example, see Sending Email During Pipeline Processing.
- With the Google BigQuery executor to run SQL queries each time the origin successfully completes a query.
-
With a destination to store information about completed queries.
For an example, see Preserving an Audit Trail of Events.
For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.
Event Record
Event records generated by the Google BigQuery origin have the following event-related record header attributes:
Record Header Attribute | Description |
---|---|
sdc.event.type | Event type. Uses the following type:
|
sdc.event.version | Integer that indicates the version of the event record type. |
sdc.event.creation_timestamp | Epoch timestamp when the stage created the event. |
- Query success
- The origin generates a query success event record when it completes processing the data returned from a query.
Configuring a Google BigQuery Origin
Configure a Google BigQuery origin to execute a query job and read the result from Google BigQuery.
-
In the Properties panel, on the General tab, configure the
following properties:
General Property Description Name Stage name. Description Optional description. Produce Events Generates event records when events occur. Use for event handling. On Record Error Error record handling for the stage: - Discard - Discards the record.
- Send to Error - Sends the record to the pipeline for error handling.
- Stop Pipeline - Stops the pipeline.
-
On the BigQuery tab, configure the following
properties:
BigQuery Property Description Query SQL query to use for the query job. Write the query using valid BigQuery standard SQL or legacy SQL syntax. Do not include the
#legacySql
or#standardSql
prefix in the query. Instead, select or clear the Use Legacy SQL property to specify the SQL syntax type.Use Legacy SQL Specifies whether the query uses standard SQL or legacy SQL syntax. Clear to use standard SQL. Select to use legacy SQL.
Use Cached Query Results Determines whether Google BigQuery retrieves cached results if they are present. Select to retrieve cached results. Clear to disable retrieving cached results.
Use this option with caution.
Query Timeout (sec) Maximum number of seconds to wait for the query to finish. If the query fails to complete within the timeout, the origin aborts the query and the pipeline fails. Enter a time in seconds or use the MINUTES or HOURS constant in an expression to define the time increment.
Default is five minutes, defined as follows: ${5 * MINUTES}.
Max Batch Size (records) Maximum number of records to include in a batch. -
On the Credentials tab, configure the following
properties:
Credentials Property Description Project ID Google Cloud project ID to use.
Credentials Provider Provider for Google Cloud credentials: - Default credentials provider - Uses Google Cloud default credentials.
- Service account credentials file (JSON) - Uses credentials stored in a JSON service account credentials file.
- Service account credentials (JSON) - Uses JSON-formatted credentials information from a service account credentials file.
Credentials File Path (JSON) Path to the Google Cloud service account credentials file used to connect. The credentials file must be a JSON file. Enter a path relative to the Data Collector resources directory,
$SDC_RESOURCES
, or enter an absolute path.Credentials File Content (JSON) Contents of a Google Cloud service account credentials JSON file used to connect. Enter JSON-formatted credential information in plain text, or use an expression to call the information from runtime resources or a credential store.