Amazon EMR Serverless

You can run Transformer pipelines on an Amazon EMR Serverless application. An EMR Serverless application uses a framework based on a version of Amazon EMR and a Spark runtime application. In Transformer, you configure an Amazon EMR Serverless application as a cluster manager.

Pipelines can use an existing EMR Serverless application or create a new one. Creating an application that terminates after the pipeline stops is a cost-effective method of running a Transformer pipeline. Running multiple pipelines on a single existing application can also reduce costs. Each application uses a specific version of Amazon EMR. For the supported versions, see Cluster Compatibility Matrix.

EMR Serverless applications run in a virtual private cloud (VPC). To use an EMR Serverless application, you must install Transformer in that VPC or in an accessible location. The application must also be able to access the origin and destination systems configured in the pipeline. To configure pipelines that create new applications, you need a VPC that can run the applications. You must specify subnet IDs from the VPC and security group IDs needed for communication.

Each pipeline has a staging location to store needed files. Pipelines started from the same Transformer instance can share a staging location, allowing Transformer to reuse the common files stored in that location.

To run a pipeline on an EMR Serverless application, configure the pipeline to use EMR Serverless as the cluster manager type, and then configure the needed properties. These properties specify the following:
  • Authentication methods
  • Information about an existing EMR Serverless application or details to define a new one
  • Amazon S3 bucket and path for storage
  • Path to the staging location
  • Retry parameters for failed attempts

You can also use a connection to configure the pipeline.

The following image shows part of the Cluster tab of a pipeline configured to run on an EMR Serverless application:

VPC Access

Each EMR Serverless application resides in a virtual private cloud (VPC). You must install Transformer in that VPC or in a location that the VPC can access. Similarly, the origin and destination systems configured in the pipeline must be accessible from the VPC.

For pipelines that create new EMR Serverless applications, you need a VPC for the applications. You can use an existing VPC or create a new one. In the pipeline configuration, you specify the VPC subnet IDs that contain Transformer and the origin and destination systems configured in the pipeline. You also specify the security group IDs needed to communicate with Transformer and the origin and destination systems.

For pipelines that use an existing EMR Serverless application, you might need to update the application to select the subnets and security groups needed to access Transformer and the origin and destination systems configured in the pipeline.

StreamSets recommends installing Transformer on an Amazon EC2 instance. You can do one of the following:
  • Set up a self-managed deployment and launch Transformer on an Amazon EC2 instance in your VPC.
  • Set up an Amazon EC2 deployment that automatically provisions EC2 instances in your VPC and then launches a Transformer engine on each instance.

For information about VPCs, see the Amazon VPC documentation. For information about configuring VPC access for an EMR Serverless application, see the Amazon EMR Serverless documentation.

Authentication Method

You configure the authentication method that Transformer uses to connect to an EMR Serverless application. When you start the pipeline, Transformer uses the specified instance profile or AWS access key to launch the Spark application.

Transformer can authenticate in the following ways:
Instance profile
When Transformer runs on an Amazon EC2 instance that has an associated instance profile, Transformer uses the instance profile credentials to automatically authenticate with AWS.
For more information about associating an instance profile with an EC2 instance, see the Amazon EC2 documentation.
AWS access keys
When Transformer does not run on an Amazon EC2 instance or when the EC2 instance doesn’t have an instance profile, you can authenticate using an AWS access key pair. When using an AWS access key pair, you specify the access key ID and secret access key to use.
Tip: To secure sensitive information, you can use credential stores or runtime resources.
None
You can optionally choose to connect anonymously using no authentication.

You can also temporarily assume a specified role to connect to the Amazon EMR Serverless application. For more information, see Assume Another Role.

Existing Application

You can configure a pipeline to run on an existing EMR Serverless application.

To configure the pipeline, click the Cluster tab on the pipeline properties, clear the Create a New Application property, and then specify the application to use. You can specify an application by application ID or by application name and optional tags.

Transformer and the origin and destination systems configured in the pipeline must be accessible from the VPC where the specified application resides. You might need to update the application to select the subnets and security groups needed to access Transformer and the origin and destination systems. For information about configuring VPC access for an EMR Serverless application, see the Amazon EMR Serverless documentation.

You can configure all pipelines that run on an existing application to use the same staging location. With this configuration, EMR Serverless can reuse the common files stored in that location.

Tip: When feasible, running multiple pipelines on a single existing application can be a cost-reducing measure.

For information about interacting with an EMR Serverless application, see the Amazon EMR Serverless documentation.

Specifying an Application

When you configure pipeline properties for an existing EMR Serverless application, you specify the application to use by application ID, by default. If you prefer, you can specify the application using the application name and optional tags.

The cluster name property is case-sensitive. If your EMR account has access to more than one application by the same name, define tags to differentiate between the applications.

Transformer searches for applications that match the specified name and tags and that are not in Terminal state. If it finds more than one matching applications, Transformer randomly selects one to use.
Tip: If you have multiple applications of the same name with the same tags, to ensure that Transformer runs a pipeline on a particular application, specify the application by application ID.

To specify the application by ID, on the Cluster tab, enter the application ID in the Application ID property.

To specify the application by name and tags, on the Cluster tab, enable the Application by Name and Tags property. Then, specify a case-sensitive application name in the EMR Application Name property.

To add optional tags, in the EMR Application Tags property, click Add, then define a tag name and value for each tag that you want to define.

Newly Created Application

You can configure a pipeline to run on a newly created EMR Serverless application. When configured, Transformer creates the application upon the initial run of a pipeline. You can configure Transformer to terminate the application after the pipeline stops.

Tip: Creating an application that terminates after the pipeline stops is a cost-effective method of running a Transformer pipeline.

Before you configure a pipeline to run on a new application, you need a VPC for the applications. You can use an existing VPC or create a new one. The pipeline configuration requires you to specify subnet and security group IDs from the VPC. For information about VPCs, see the Amazon VPC documentation.

To configure a pipeline to run on a new application, click the Cluster tab on the pipeline properties, select the Create a New Application property, and configure the application properties. You specify details, such as the runtime role, the EMR version, IDs of the subnets that contain Transformer and the origin and destination systems configured in the pipeline, IDs of the security groups that can access Transformer and the origin and destination systems configured in the pipeline, and the maximum size of the application. You also indicate whether to stop the application after the pipeline stops.

You can also configure the pipeline to save log data to a different location to avoid losing that data when the application terminates.

For a full list of properties, see Configuring a Pipeline.

For more information about applications, see the Amazon EMR Serverless documentation.

Staging Location

Pipelines that run on an EMR Serverless application store files in an intermediate Amazon S3 location.

To determine the intermediate location, Transformer concatenates the values from two properties on the Cluster tab, S3 Staging URI and Staging Directory, as follows:
<S3 staging URI>/<staging directory>

The location must exist before you start the pipeline.

Pipelines that run from the same Transformer instance can use the same staging location. With the same staging location configured, Transformer can reuse the common files stored in that location. However, pipelines started by different Transformer instances must use different staging locations.

Note: If you have multiple instances of Transformer that are configured to use different library versions, you must specify a different S3 staging URI or staging directory to avoid using the same staging location. For example, suppose you have two 5.6.0 instances of Transformer, each using a different Oracle JDBC driver. To allow each Transformer instance to use its own driver version, specify different staging locations for those pipelines.
Transformer stores the following files in specific locations:
Files that can be reused across pipelines
Transformer stores files that can be reused across pipelines, including Transformer libraries and external resources such as JDBC drivers, in the following location:
<S3 staging URI>/<staging_directory>/<Transformer version>
For example, say you configure s3://mybucket as the S3 staging URI with the default /streamsets staging directory for a Transformer 5.6.0 pipeline. Then, Transformer stores the reusable files in the following location:

s3://mybucket/streamsets/5.6.0

Files specific to each pipeline
Transformer stores files specific to each pipeline, such as the pipeline JSON file and resource files used by the pipeline, in the following directory:
<S3 staging URI>/<staging_directory>/staging/<pipelineId>/<runId>
For example, say you configure s3://mybucket as the S3 staging URI with the default /streamsets staging directory to run a pipeline named KafkaToJDBC. Transformer stores pipeline-specific files in a location like the following:

s3://mybucket/streamsets/staging/KafkaToJDBC03a0d2cc-f622-4a68-b161-7f2d9a4f3052/run1557350076328

Monitoring EMR Serverless Pipelines

When Transformer cannot access monitoring information for an EMR Serverless application, Transformer tries to retrieve that information again based on the following Transformer configuration properties:
  • transformer.emr.monitoring.max.retry - Defines how many times Transformer retries retrieving monitoring information. Default is 3.
  • transformer.emr.monitoring.retry.base.backoff - Defines the minimum number of seconds to wait between retries. Default is 15.
  • transformer.emr.monitoring.retry.max.backoff - Defines the maximum number of seconds to wait between retries. Default is 300, which is 5 minutes.

If Transformer cannot access monitoring information within the specified retries, it assumes that the Spark job for the pipeline is not running and stops the pipeline.

When needed, you can configure these properties in the Transformer configuration properties of the deployment.

Note: These properties also define how Transformer tries to retrieve monitoring information for EMR pipelines.