GCS Staging URI

To run pipelines on a Dataproc cluster, Transformer must store files in a staging directory on Google Cloud Storage.

You can configure the root directory to use as the staging directory. The default staging directory is /streamsets.

Use the following guidelines for the GCS staging URI:
  • The location must exist before you start the pipeline.
  • When a pipeline runs on an existing cluster, configure pipelines to use the same staging directory so that each Spark job created within Dataproc can reuse the common files stored in the directory.
  • Pipelines that run on different clusters can use the same staging directory as long as the pipelines are started by the same Transformer instance. Pipelines that are started by different instances of Transformer must use different staging directories.
  • When a pipeline runs on a provisioned cluster, using the same staging directory for pipelines is best practice, but not required.
Transformer stores the following files in the staging directory:
Files that can be reused across pipelines
Transformer stores files that can be reused across pipelines, including Transformer libraries and external resources such as JDBC drivers, in the following location:
/<staging_directory>/<Transformer version>
For example, say you use the default staging directory for Transformer version 4.1.0. Then, Transformer stores the reusable files in the following location:
/streamsets/4.1.0
Files specific to each pipeline
Transformer stores files specific to each pipeline, such as the pipeline JSON file and resource files used by the pipeline, in the following directory:
/<staging_directory>/staging/<pipelineId>/<runId>
For example, say you use the default staging directory and run a pipeline named KafkaToJDBC. Transformer stores pipeline-specific files in a directory like the following:
/streamsets/staging/KafkaToJDBC03a0d2cc-f622-4a68-b161-7f2d9a4f3052/run1557350076328