GCS Staging URI
To run pipelines on a Dataproc cluster, Transformer must store files in a staging directory on Google Cloud Storage.
You can configure the root directory to use as the staging directory. The default staging directory is /streamsets.
Use the following guidelines for the GCS staging URI:
- The location must exist before you start the pipeline.
- When a pipeline runs on an existing cluster, configure pipelines to use the same staging directory so that each Spark job created within Dataproc can reuse the common files stored in the directory.
- Pipelines that run on different clusters can use the same staging directory as long as the pipelines are started by the same Transformer instance. Pipelines that are started by different instances of Transformer must use different staging directories.
- When a pipeline runs on a provisioned cluster, using the same staging directory for pipelines is best practice, but not required.
Transformer stores
the following files in the staging directory:
- Files that can be reused across pipelines
- Transformer stores files that can be reused across pipelines, including Transformer libraries and external resources such as JDBC drivers, in the following location:
- Files specific to each pipeline
- Transformer stores files specific to each pipeline, such as the pipeline JSON file and resource files used by the pipeline, in the following directory: