Google Cloud

You can create the following types of Google Cloud connections:

Google BigQuery Connection

Available when using an authoring Data Collector version 4.0.0 or later.

To create a Google BigQuery connection, the Google Cloud stage library, streamsets-datacollector-google-cloud-lib, must be installed on the selected authoring Data Collector.

For a description of the Google BigQuery connection properties, see Google Cloud Connection Properties.

After you create a Google BigQuery connection, you can use the connection in the following stages:
Engine Stages

Data Collector 5.3.0 or later

  • Google BigQuery origin
  • Google BigQuery destination
  • Google BigQuery executor

Data Collector 4.0.0 to 5.2.x

  • Google BigQuery origin
  • Google BigQuery destination
  • Google BigQuery executor - Requires Google Enterprise stage library 1.1.0 or later.

Google Cloud Storage Connection

Available when using an authoring Data Collector version 4.0.0 or later.

To create a Google Cloud Storage connection, the Google Cloud stage library, streamsets-datacollector-google-cloud-lib must be installed on the selected authoring Data Collector.

For a description of the Google Cloud Storage connection properties, see Google Cloud Connection Properties.

After you create a Google Cloud Storage connection, you can use the connection in the following stages and locations:
Engine Stages and Locations
Data Collector 4.1.0 or later
  • Google Cloud Storage executor

Data Collector 4.0.0 or later

  • Google Cloud Storage origin
  • Google Cloud Storage destination
  • Write to Google Cloud Storage error record handling configured for a pipeline
  • Google Cloud Storage staging location for the Google BigQuery destination

Google Pub/Sub Connection

Available when using an authoring Data Collector version 4.0.0 or later.

To create a Google Pub/Sub connection, the Google Cloud stage library, streamsets-datacollector-google-cloud-lib, must be installed on the selected authoring Data Collector.

For a description of the Google Pub/Sub connection properties, see Google Cloud Connection Properties.

After you create a Google Pub/Sub connection, you can use the connection in the following stages and locations:
Engine Stages and Locations

Data Collector 4.0.0 or later

  • Google Pub/Sub Subscriber origin
  • Google Pub/Sub Publisher destination
  • Write to Google Pub/Sub error record handling configured for a pipeline

Google Cloud Credentials

Each Google Cloud connection must pass credentials to Google Cloud.

You can provide credentials using one the following options:
  • Google Cloud default credentials
  • Credentials in a file
  • Credentials in a connection property

Default Credentials

You can configure the connection to use Google Cloud default credentials. When using Google Cloud default credentials, the pipeline checks for the credentials file defined in the GOOGLE_APPLICATION_CREDENTIALS environment variable.

When using a self-managed deployment, set the environment variable on each Data Collector machine. When using a GCE deployment, the deployment must use an instance service account with access to Google Secret Manager. You cannot use the default credentials mode with other deployment types.

For more information about the default credentials, see Finding credentials automatically in the Google Cloud documentation.

Complete the following steps to define the credentials file in the environment variable:
  1. Use the Google Cloud Platform Console or the gcloud command-line tool to create a Google service account and have your application use it for API access.
    For example, to use the command line tool, run the following commands:
    gcloud iam service-accounts create my-account
    gcloud iam service-accounts keys create key.json --iam-account=my-account@my-project.iam.gserviceaccount.com
  2. Store the generated credentials file in a local directory external to the Data Collector installation directory.
    For example, if you installed Data Collector in the following directory:
    /opt/sdc/
    you might store the credentials file at:
    /opt/sdc-credentials
    Important: The file must exist in the same location on all execution engines that access the connection.
  3. For all registered Data Collectors that access the connection, add the GOOGLE_APPLICATION_CREDENTIALS environment variable to the appropriate file and point it to the credentials file.

    Modify environment variables using the method required by your installation type.

    Set the environment variable as follows:
    export GOOGLE_APPLICATION_CREDENTIALS="/var/lib/sdc-resources/keyfile.json"
  4. Restart each Data Collector to enable the changes.
  5. On the Credentials tab for the connection, for the Credential Provider property, select Default Credentials Provider.

Credentials in a File

You can configure the connection to use credentials in a Google Cloud service account credentials JSON file.

Complete the following steps to use credentials in a file:

  1. Generate a service account credentials file in JSON format.

    Use the Google Cloud Platform Console or the gcloud command-line tool to generate and download the credentials file. For more information, see Generating a service account credential in the Google Cloud Platform documentation.

  2. Store the generated credentials file on the Data Collector machine.
    As a best practice, store the file in the Data Collector resources directory, $SDC_RESOURCES.
    Important: The file must exist in the same location on all execution engines that access the connection.
  3. On the Credentials tab for the connection, for the Credential Provider property, select Service Account Credentials File. Then, enter the path to the credentials file.

Credentials in a Connection Property

You can configure the connection to use credentials specified in a connection property. When using credentials in connection properties, you provide JSON-formatted credentials from a Google Cloud service account credential file.

You can enter credential details in plain text, but best practice is to secure the credential details using credential stores or runtime resources.

Complete the following steps to use credentials specified in connection properties:
  1. Generate a service account credentials file in JSON format.

    Use the Google Cloud Platform Console or the gcloud command-line tool to generate and download the credentials file. For more information, see Generating a service account credential in the Google Cloud Platform documentation.

  2. As a best practice, secure the credentials using credential stores or runtime resources.
  3. On the Credentials tab for the connection, for the Credential Provider property, select Service Account Credentials. Then, enter the JSON-formatted credential details or an expression that calls the credentials from a credential store.

Google Cloud Connection Properties

You configure similar properties for all of the Google Cloud connections.

When creating one of the Google Cloud connections, configure the following properties on the Credentials tab:
Credentials Property Description
Project ID

Google Cloud project ID to use.

Credentials Provider Provider for Google Cloud credentials:
  • Default credentials provider - Uses Google Cloud default credentials.
  • Service account credentials file (JSON) - Uses credentials stored in a JSON service account credentials file.
  • Service account credentials (JSON) - Uses JSON-formatted credentials information from a service account credentials file.
Credentials File Path (JSON) Path to the Google Cloud service account credentials file used to connect. The credentials file must be a JSON file.
Important: The file must exist in the same location on all execution engines that access the connection.

Enter a path relative to the Data Collector resources directory, $SDC_RESOURCES, or enter an absolute path.

Credentials File Content (JSON) Contents of a Google Cloud service account credentials JSON file used to connect.

Enter JSON-formatted credential information in plain text, or use an expression to call the information from credential stores or runtime resources.