To run pipelines on an EMR cluster, Transformer must store files on Amazon S3.
Transformer stores libraries in the following location: <S3 staging URI>/<staging
directory>
.
The location must exist before you start the pipeline. You define the
location using the following pipeline configuration properties on the Cluster tab:
- Staging Directory
- S3 Staging URI
When a pipeline runs on an existing cluster, you might
configure pipelines to use the same S3 staging URI and staging directory. This
allows EMR to reuse the common files stored in that location. Pipelines that run on
different clusters can also use the same staging locations as long as the pipelines
are started by the same Transformer
instance. Pipelines started by different Transformer
instances must use different staging locations.
When a pipeline runs on a provisioned cluster, using the same location is best practice,
but not required.
Note: If you have multiple instances of Transformer that are configured to use different library versions, you might specify a
different S3 staging URI or staging directory to avoid using the same staging
location. For example, if you have two 4.1.0
Transformers, each using a different Oracle JDBC driver. To allow each Transformer to use their own driver version, specify different staging locations for those
pipelines.
Transformer stores the following files in the specified location:
- Files that can be reused across pipelines
- Transformer
stores files that can be reused across pipelines, including Transformer libraries
and external resources such as JDBC drivers, in the following location:
- <S3 staging URI>/<staging_directory>/<Transformer
version>
- For example, say you use
s3://mybucket
as the S3 staging
URI and the default /streamsets
staging directory for a Transformer
4.1.0 pipeline. Then, Transformer stores the reusable files in the following location:
s3://mybucket
/streamsets/4.1.0
- Files specific to each pipeline
- Transformer
stores files specific to each pipeline, such as the pipeline JSON file and
resource files used by the pipeline, in the following directory:
- <S3 staging URI>/<staging_directory>/staging/<pipelineId>/<runId>
- For example, say you use
s3://mybucket
as the S3 staging
URI and the default /streamsets
staging directory to run a
pipeline named KafkaToJDBC. Transformer stores pipeline-specific files in a directory like the following:
- s3://mybucket/streamsets/staging/KafkaToJDBC03a0d2cc-f622-4a68-b161-7f2d9a4f3052/run1557350076328