Google BigQuery

The Google BigQuery origin executes a query job and reads the result from Google BigQuery.

The origin submits the query that you define, and then Google BigQuery runs the query as an interactive query. When the query is complete, the origin reads the query results to generate records. The origin runs the query once and then the pipeline stops when it finishes reading all query results. If you start the pipeline again, the origin submits the query again.

When you configure the origin, you define the query to run using valid BigQuery standard SQL or legacy SQL syntax. By default, BigQuery writes all query results to a temporary, cached results table. You can choose to disable retrieving cached results and force BigQuery to compute the query result.

You also define the project and credentials provider to use to connect to Google BigQuery. The origin can retrieve credentials from the Google Application Default Credentials or from a Google Cloud service account credentials file.

The origin can generate events for an event stream. For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.

Credentials

When the Google BigQuery origin executes a query job and reads the result from Google BigQuery, it must pass credentials to Google BigQuery. Configure the origin to retrieve the credentials from the Google Application Default Credentials or from a Google Cloud service account credentials file.

Default Credentials Provider

When configured to use the Google Application Default Credentials, the origin checks for the credentials file defined in the GOOGLE_APPLICATION_CREDENTIALS environment variable. If the environment variable doesn't exist and Data Collector is running on a virtual machine (VM) in Google Cloud Platform (GCP), the origin uses the built-in service account associated with the virtual machine instance.

For more information about the default credentials, see Google Application Default Credentials in the Google Developer documentation.

Complete the following steps to define the credentials file in the environment variable:

Use the Google Cloud Platform Console or the gcloud command-line tool to create a Google service account and have your application use it for API access.
For example, to use the command line tool, run the following commands:
```
gcloud iam service-accounts create my-account
gcloud iam service-accounts keys create key.json --iam-account=my-account@my-project.iam.gserviceaccount.com
```
Store the generated credentials file on the Data Collector machine.
Add the GOOGLE_APPLICATION_CREDENTIALS environment variable to the appropriate file and point it to the credentials file.
Modify environment variables using the method required by your installation type.

Set the environment variable as follows:
```
export GOOGLE_APPLICATION_CREDENTIALS="/var/lib/sdc-resources/keyfile.json"
```
Restart Data Collector to enable the changes.
On the Credentials tab for the stage, select Default Credentials Provider for the credentials provider.

Service Account Credentials File (JSON)

When configured to use the Google Cloud service account credentials file, the origin checks for the file defined in the origin properties.

Complete the following steps to use the service account credentials file:

Generate a service account credentials file in JSON format.
Use the Google Cloud Platform Console or the gcloud command-line tool to generate and download the credentials file. For more information, see generating a service account credential in the Google Cloud Platform documentation.
Store the generated credentials file on the Data Collector machine.
As a best practice, store the file in the Data Collector resources directory, $SDC_RESOURCES.
On the Credentials tab for the stage, select Service Account Credentials File for the credentials provider and enter the path to the credentials file.

BigQuery Data Types

The Google BigQuery origin converts the Google BigQuery data types to Data Collector data types.

The data types supported by Google BigQuery depend on whether you use standard SQL or legacy SQL syntax to define the query:

Standard SQL Data Types

The following table lists the data types that Google BigQuery standard SQL supports and the Data Collector data types that the origin converts them to:

BigQuery Standard SQL Data Type	Data Collector Data Type
Array	List
Boolean	Boolean
Bytes	Byte Array
Date	Date
Datetime	Datetime
Float	Double
Integer	Long
String	String
Struct	List-Map
Time	Datetime
Timestamp	Datetime

Legacy SQL Data Types

The following table lists the data types that Google BigQuery legacy SQL supports and the Data Collector data types that the origin converts them to.

Note: The Google BigQuery origin converts repeated fields into a List.

BigQuery Legacy SQL Data Type	Data Collector Data Type
Boolean	Boolean
Bytes	Byte Array
Date	Date
Datetime	Datetime
Float	Double
Integer	Long
Record	List-Map
String	String
Time	Datetime
Timestamp	Datetime

Event Generation

The Google BigQuery origin generates an event when a query completes successfully.

A Google BigQuery event can be used in any logical way. For example:

With the Email executor to send a custom email after receiving an event.
For an example, see Case Study: Sending Email.

With a destination to store information about completed queries.

For an example, see Case Study: Event Storage.

For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.

Event Record

Event records generated by the Google BigQuery origin have the following event-related record header attributes:

Record Header Attribute	Description
sdc.event.type	Event type. Uses the following type: big-query-success - Generated when the origin successfully completes a query.
sdc.event.version	An integer that indicates the version of the event record type.
sdc.event.creation_timestamp	Epoch timestamp when the stage created the event.

The origin can generate the following type of event records:

Query success

The origin generates a query success event record when it completes processing the data returned from a query.

The query success event records have the sdc.event.type record header attribute set to big-query-success and include the following fields:

Field	Description
query	Query that completed successfully.
timestamp	Timestamp when the query completed.
row-count	Number of processed rows.
source-offset	Offset after the query completed.

Configuring a Google BigQuery Origin

Configure a Google BigQuery origin to execute a query job and read the result from Google BigQuery.

In the Properties panel, on the General tab, configure the following properties:

General Property	Description
Name	Stage name.
Description	Optional description.
Produce Events	Generates event records when events occur. Use for event handling.
On Record Error	Error record handling for the stage: Discard - Discards the record. Send to Error - Sends the record to the pipeline for error handling. Stop Pipeline - Stops the pipeline.

On the BigQuery tab, configure the following properties:

BigQuery Property	Description
Query	SQL query to use for the query job. Write the query using valid BigQuery standard SQL or legacy SQL syntax. Do not include the `#legacySql` or `#standardSql` prefix in the query. Instead, select or clear the Use Legacy SQL property to specify the SQL syntax type.
Use Legacy SQL	Specifies whether the query uses standard SQL or legacy SQL syntax. Clear to use standard SQL. Select to use legacy SQL.
Use Query Cache	Determines whether Google BigQuery retrieves cached results if they are present. Select to retrieve cached results. Clear to disable retrieving cached results.
Query Timeout (sec)	Maximum number of seconds to wait for the query to finish. If the query fails to complete within the timeout, the origin aborts the query and the pipeline fails. Enter a time in seconds or use the `MINUTES` or `HOURS` constant in an expression to define the time increment. Default is five minutes, defined as follows: `${5 * MINUTES}`.
Max Batch Size (records)	Maximum number of records to include in a batch.

On the Credentials tab, configure the following properties:

Credentials Property	Description
Project ID	Google BigQuery project ID to connect to.
Credentials Provider	Credentials provider to use to connect to Google BigQuery: Default credentials provider Service account credentials file (JSON)
Credentials File Path (JSON)	When using a Google Cloud service account credentials file, path to the file that the origin uses to connect to Google BigQuery. The credentials file must be a JSON file. Enter a path relative to the Data Collector resources directory, `$SDC_RESOURCES`, or enter an absolute path.