Google BigQuery
The Google BigQuery executor runs one or more SQL queries on Google BigQuery each time it receives an event record. For information about supported versions, see Supported Systems and Versions in the Data Collector documentation.
Use the executor as part of an event stream in the pipeline. For example, you might use the Google BigQuery executor to execute a stored procedure in the database when the pipeline generates a pipeline stop event.
When you configure the executor, you specify authentication information for Google BigQuery. You can optionally configure the executor to connect to BigQuery through a proxy server. You can also use a connection to configure the destination.
You specify one or more SQL queries to run and how to submit the queries. You can also configure the executor to generate events for another event stream, and whether you want a query result count included in generated event records. For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.
Before you use the Google BigQuery executor, you must complete a prerequisite task.
Prerequisite
Executing a Google BigQuery SQL query requires that the user specified in the Google BigQuery executor has the appropriate BigQuery permissions.
- bigquery.jobs.create
The user might require additional permissions depending on the types of SQL queries specified in the executor. For information about permissions needed for different types of queries, see the Google BigQuery documentation.
Credentials
When the Google BigQuery executor connects to BigQuery, the executor must pass credentials to Google Cloud Storage and then to Google BigQuery.
- Google Cloud default credentials
- Credentials in a file
- Credentials in a stage property
For details on how to configure each option, see Security in Google Cloud Stages.
SQL Queries
You can specify one or more queries to perform each time that the Google BigQuery executor receives an event record.
When a query fails, the Google BigQuery executor treats the event record that triggered the query like an error record. If you specify more than one query and multiple queries fail for an event record, the executor creates an error record for each failed query, and includes query failure details in the header attributes for the error record.
- Query submission
- Configure the Query Submission property to define how the executor submits
your queries:
- Sequential - For each incoming event record, the executor submits one query at a time, and waits until the previous query is complete before submitting the next query. Use when the run order for the queries is important. Queries are submitted in the order that they appear in the executor.
- Parallel - For each incoming event record, the executor submits all queries at the same time. Use when the run order for the queries is not important.
- Expressions in queries
- You can include a subset of the functions provided with the StreamSets expression language in a SQL query. These expressions are evaluated before the executor passes the query to BigQuery.
Event Generation
The Google BigQuery executor can generate events that you can use in an event stream. When you enable event generation, the executor generates events for each successful or failed query.
- With the Email executor to send a custom email
after receiving an event.
For an example, see Sending Email During Pipeline Processing.
- With a destination to store event information.
For an example, see Preserving an Audit Trail of Events.
For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.
Event Records
Record Header Attribute | Description |
---|---|
sdc.event.type | Event type. Uses the following event types:
|
sdc.event.version | Integer that indicates the version of the event record type. |
sdc.event.creation_timestamp | Epoch timestamp when the stage created the event. |
- Successful query
-
The executor generates a successful-query event record after successfully completing a query.
Successful-query event records have thesdc.event.type
record header attribute set tosucessful-query
and include the following fields:Event Field Name Description query Query completed. query-result Number of rows affected by query. Included if the Query Result Count in Events property is selected. - Failed query
-
The executor generates a failed-query event record after failing to complete a query.
Failed-query event records have thesdc.event.type
record header attribute set tofailed-query
and include the following field:Event Field Name Description query Query attempted.
Configuring a Google BigQuery Executor
-
In the Properties panel, on the General tab, configure the
following properties:
General Property Description Name Stage name. Description Optional description. Produce Events Generates event records when events occur. Use for event handling. Required Fields Fields that must include data for the record to be passed into the stage. Tip: You might include fields that the stage uses.Records that do not include all required fields are processed based on the error handling configured for the pipeline.
Preconditions Conditions that must evaluate to TRUE to allow a record to enter the stage for processing. Click Add to create additional preconditions. Records that do not meet all preconditions are processed based on the error handling configured for the stage.
On Record Error Error record handling for the stage: - Discard - Discards the record.
- Send to Error - Sends the record to the pipeline for error handling.
- Stop Pipeline - Stops the pipeline.
-
On the Google BigQuery tab, configure the following
properties:
Google BigQuery Property Description Connection Connection that defines the information required to connect to an external system. To connect to an external system, you can select a connection that contains the details, or you can directly enter the details in the pipeline. When you select a connection, Control Hub hides other properties so that you cannot directly enter connection details in the pipeline.
Project ID Google Cloud project ID to use.
Credentials Provider Provider for Google Cloud credentials: - Default credentials provider - Uses Google Cloud default credentials.
- Service account credentials file (JSON) - Uses credentials stored in a JSON service account credentials file.
- Service account credentials (JSON) - Uses JSON-formatted credentials information from a service account credentials file.
Credentials File Path (JSON) Path to the Google Cloud service account credentials file used to connect. The credentials file must be a JSON file. Enter a path relative to the Data Collector resources directory,
$SDC_RESOURCES
, or enter an absolute path.Credentials File Content (JSON) Contents of a Google Cloud service account credentials JSON file used to connect. Enter JSON-formatted credential information in plain text, or use an expression to call the information from runtime resources or a credential store. For more information about credential stores, see Credential Stores in the Data Collector documentation.
Connection Pool Size Maximum number of connections that the executor uses to connect to Google BigQuery. Default is 0, which ensures that the destination uses the same number of connections as threads used by the pipeline. When loading to multiple tables, increasing this property can improve performance.
Enable Proxy Connects to the Google BigQuery API through an HTTP proxy server. Proxy URL URL of the HTTP proxy server in the following format: http://<host>:<port>
Proxy Username Optional user name to authenticate with the proxy server. Proxy Password Optional password to authenticate with the proxy server. Use Legacy SQL Enables using Google BigQuery legacy SQL instead of standard SQL in SQL queries. Use Cached Query Results Uses the query results cached on Google Cloud Storage. Use this option with caution.
-
On the Queries tab, configure the following
properties:
Queries Property Description SQL Queries One or more SQL queries to execute upon receiving an event record. By default, queries are executed in the specified order. To add additional queries, click Add.
Query Submission Determines how the executor submits multiple queries: - Sequential - For each incoming event record, the executor submits one query at a time, and waits until the previous query is complete before submitting the next query. Use when the run order for the queries is important. Queries are submitted in the order that they appear in the executor.
- Parallel - For each incoming event record, the executor submits all queries at the same time. Use when the run order for the queries is not important.
By default, the executor waits for a query to complete before submitting the next query.
Include Query Result Count in Events When the executor is configured to generate events, the executor includes query result counts in event records. Query Timeout (sec) Seconds to wait for the results of the query.