S3 Staging URI and Directory

To run pipelines on an EMR cluster, Transformer must store files on Amazon S3.

Transformer stores libraries in the following location: <S3 staging URI>/<staging directory>.

The location must exist before you start the pipeline. You define the location using the following pipeline configuration properties on the Cluster tab:
  • Staging Directory
  • S3 Staging URI

When a pipeline runs on an existing cluster, you might configure pipelines to use the same S3 staging URI and staging directory. This allows EMR to reuse the common files stored in that location. Pipelines that run on different clusters can also use the same staging locations as long as the pipelines are started by the same Transformer instance. Pipelines started by different Transformer instances must use different staging locations.

When a pipeline runs on a provisioned cluster, using the same location is best practice, but not required.

Note: If you have multiple instances of Transformer that are configured to use different library versions, you might specify a different S3 staging URI or staging directory to avoid using the same staging location. For example, if you have two 4.1.0 Transformers, each using a different Oracle JDBC driver. To allow each Transformer to use their own driver version, specify different staging locations for those pipelines.
Transformer stores the following files in the specified location:
Files that can be reused across pipelines
Transformer stores files that can be reused across pipelines, including Transformer libraries and external resources such as JDBC drivers, in the following location:
<S3 staging URI>/<staging_directory>/<Transformer version>
For example, say you use s3://mybucket as the S3 staging URI and the default /streamsets staging directory for a Transformer 4.1.0 pipeline. Then, Transformer stores the reusable files in the following location:
s3://mybucket/streamsets/4.1.0
Files specific to each pipeline
Transformer stores files specific to each pipeline, such as the pipeline JSON file and resource files used by the pipeline, in the following directory:
<S3 staging URI>/<staging_directory>/staging/<pipelineId>/<runId>
For example, say you use s3://mybucket as the S3 staging URI and the default /streamsets staging directory to run a pipeline named KafkaToJDBC. Transformer stores pipeline-specific files in a directory like the following:
s3://mybucket/streamsets/staging/KafkaToJDBC03a0d2cc-f622-4a68-b161-7f2d9a4f3052/run1557350076328