Google BigQuery

The Google BigQuery executor runs one or more SQL queries on Google BigQuery each time it receives an event record. For information about supported versions, see Supported Systems and Versions.

Use the executor as part of an event stream in the pipeline. For example, you might use the Google BigQuery executor to execute a stored procedure in the database when the pipeline generates a pipeline stop event.

When you configure the executor, you specify authentication information for Google BigQuery. You can optionally configure the executor to connect to BigQuery through a proxy server.

You specify one or more SQL queries to run and how to submit the queries. You can also configure the executor to generate events for another event stream, and whether you want a query result count included in generated event records. For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.

Before you use the Google BigQuery executor, you must complete a prerequisite task.

Prerequisite

Executing a Google BigQuery SQL query requires that the user specified in the Google BigQuery executor has the appropriate BigQuery permissions.

Ensure that the specified user has the following permission:
  • bigquery.jobs.create

The user might require additional permissions depending on the types of SQL queries specified in the executor. For information about permissions needed for different types of queries, see the Google BigQuery documentation.

Credentials

When the Google BigQuery executor connects to BigQuery, the executor must pass credentials to Google Cloud Storage and then to Google BigQuery.

You can provide credentials using one the following options:
  • Google Cloud default credentials
  • Credentials in a file
  • Credentials in a stage property

For details on how to configure each option, see Security in Google Cloud Stages.

SQL Queries

You can specify one or more queries to perform each time that the Google BigQuery executor receives an event record.

When a query fails, the Google BigQuery executor treats the event record that triggered the query like an error record. If you specify more than one query and multiple queries fail for an event record, the executor creates an error record for each failed query, and includes query failure details in the header attributes for the error record.

Use the following details to configure the SQL queries:
Query submission
Configure the Query Submission property to define how the executor submits your queries:
  • Sequential - For each incoming event record, the executor submits one query at a time, and waits until the previous query is complete before submitting the next query. Use when the run order for the queries is important. Queries are submitted in the order that they appear in the executor.
  • Parallel - For each incoming event record, the executor submits all queries at the same time. Use when the run order for the queries is not important.
Expressions in queries
You can include a subset of the functions provided with the StreamSets expression language in a SQL query. These expressions are evaluated before the executor passes the query to BigQuery.

When entering the query in the executor, use expression completion to view the list of valid functions that you can use.

Event Generation

The Google BigQuery executor can generate events that you can use in an event stream. When you enable event generation, the executor generates events for each successful or failed query.

Executor events can be used in any logical way. For example:

For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.

Event Records

Event records generated by the Google BigQuery executor have the following event-related record header attributes. Record header attributes are stored as String values.
Record Header Attribute Description
sdc.event.type Event type. Uses the following event types:
  • successful-query - Generated after a query successfully completes.
  • failed-query - Generated after a query fails.
sdc.event.version Integer that indicates the version of the event record type.
sdc.event.creation_timestamp Epoch timestamp when the stage created the event.
The Google BigQuery executor can generate the following types of event records:
Successful query

The executor generates a successful-query event record after successfully completing a query.

Successful-query event records have the sdc.event.type record header attribute set to sucessful-query and include the following fields:
Event Field Name Description
query Query completed.
query-result Number of rows affected by query. Included if the Query Result Count in Events property is selected.
Failed query

The executor generates a failed-query event record after failing to complete a query.

Failed-query event records have the sdc.event.type record header attribute set to failed-query and include the following field:
Event Field Name Description
query Query attempted.

Configuring a Google BigQuery Executor

Configure a Google BigQuery executor to run a SQL query on Google BigQuery upon receiving an event. Before you use the executor in a pipeline, complete the prerequisite task.
  1. In the Properties panel, on the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
    Produce Events Generates event records when events occur. Use for event handling.
    Required Fields Fields that must include data for the record to be passed into the stage.
    Tip: You might include fields that the stage uses.

    Records that do not include all required fields are processed based on the error handling configured for the pipeline.

    Preconditions Conditions that must evaluate to TRUE to allow a record to enter the stage for processing. Click Add to create additional preconditions.

    Records that do not meet all preconditions are processed based on the error handling configured for the stage.

    On Record Error Error record handling for the stage:
    • Discard - Discards the record.
    • Send to Error - Sends the record to the pipeline for error handling.
    • Stop Pipeline - Stops the pipeline.
  2. On the Google BigQuery tab, configure the following properties:
    Google BigQuery Property Description
    Project ID

    Google Cloud project ID to use.

    Credentials Provider Provider for Google Cloud credentials:
    • Default credentials provider - Uses Google Cloud default credentials.
    • Service account credentials file (JSON) - Uses credentials stored in a JSON service account credentials file.
    • Service account credentials (JSON) - Uses JSON-formatted credentials information from a service account credentials file.
    Credentials File Path (JSON) Path to the Google Cloud service account credentials file used to connect. The credentials file must be a JSON file.

    Enter a path relative to the Data Collector resources directory, $SDC_RESOURCES, or enter an absolute path.

    Credentials File Content (JSON) Contents of a Google Cloud service account credentials JSON file used to connect.

    Enter JSON-formatted credential information in plain text, or use an expression to call the information from runtime resources or a credential store.

    Connection Pool Size Maximum number of connections that the executor uses to connect to Google BigQuery. Default is 0, which ensures that the destination uses the same number of connections as threads used by the pipeline.

    When loading to multiple tables, increasing this property can improve performance.

    Enable Proxy Connects to the Google BigQuery API through an HTTP proxy server.
    Proxy URL URL of the HTTP proxy server in the following format:

    http://<host>:<port>

    Proxy Username Optional user name to authenticate with the proxy server.
    Proxy Password Optional password to authenticate with the proxy server.
    Use Legacy SQL Enables using Google BigQuery legacy SQL instead of standard SQL in SQL queries.
    Use Cached Query Results Uses the query results cached on Google Cloud Storage.

    Use this option with caution.

  3. On the Queries tab, configure the following properties:
    Queries Property Description
    SQL Queries One or more SQL queries to execute upon receiving an event record. By default, queries are executed in the specified order.

    To add additional queries, click Add.

    Query Submission Determines how the executor submits multiple queries:
    • Sequential - For each incoming event record, the executor submits one query at a time, and waits until the previous query is complete before submitting the next query. Use when the run order for the queries is important. Queries are submitted in the order that they appear in the executor.
    • Parallel - For each incoming event record, the executor submits all queries at the same time. Use when the run order for the queries is not important.

    By default, the executor waits for a query to complete before submitting the next query.

    Include Query Result Count in Events When the executor is configured to generate events, the executor includes query result counts in event records.
    Query Timeout (sec) Seconds to wait for the results of the query.