Google Big Query

The Google Big Query origin reads data from a Google BigQuery table. Use the origin in Dataproc cluster pipelines only.

When you configure the origin, you specify the dataset and table name. The origin reads the entire table by default. You can configure the origin to process only the specified columns. You can also limit the query by defining a filter condition to include in a WHERE clause.

You indicate if the origin should run in incremental mode or full query mode. When running in incremental mode, you define the offset column and initial offset.

You can specify the number of workers that the origin uses to read from BigQuery.

You can configure the origin to load data only once and cache the data for reuse throughout the pipeline run. Or, you can configure the origin to cache each batch of data so the data can be passed to multiple downstream batches efficiently. You can also configure the origin to skip tracking offsets.