Staging Directory

To run pipelines on a Databricks cluster, Transformer must store files in a staging directory on Databricks File System (DBFS).

You can configure the root directory to use as the staging directory. The default staging directory is /streamsets.

When a pipeline runs on an existing interactive cluster, configure pipelines to use the same staging directory so that each job created within Databricks can reuse the common files stored in the directory. Pipelines that run on different clusters can use the same staging directory as long as the pipelines are started by the same Transformer instance. Pipelines that are started by different instances of Transformer must use different staging directories. Different Transformer instances cannot send pipelines to the same cluster.

When a pipeline runs on a provisioned job cluster, using the same staging directory for pipelines is best practice, but not required.

Transformer stores the following files in the staging directory:
Files that can be reused across pipelines
Transformer stores files that can be reused across pipelines, including Transformer libraries and external resources such as JDBC drivers, in the following location:
/<staging_directory>/<Transformer version>
For example, say you use the default staging directory for Transformer version 4.1.0. Then, Transformer stores the reusable files in the following location:
/streamsets/4.1.0
Files specific to each pipeline
Transformer stores files specific to each pipeline, such as the pipeline JSON file and resource files used by the pipeline, in the following directory:
/<staging_directory>/staging/<pipelineId>/<runId>
For example, say you use the default staging directory and run a pipeline named KafkaToJDBC. Transformer stores pipeline-specific files in a directory like the following:
/streamsets/staging/KafkaToJDBC03a0d2cc-f622-4a68-b161-7f2d9a4f3052/run1557350076328