Amazon EMR Serverless
You can run Transformer pipelines on an Amazon EMR Serverless application. An EMR Serverless application uses a framework based on a version of Amazon EMR and a Spark runtime application. In Transformer, you configure an Amazon EMR Serverless application as a cluster manager.
Pipelines can use an existing EMR Serverless application or create a new one. Creating an application that terminates after the pipeline stops is a cost-effective method of running a Transformer pipeline. Running multiple pipelines on a single existing application can also reduce costs. Each application uses a specific version of Amazon EMR. For the supported versions, see Cluster Compatibility Matrix.
EMR Serverless applications run in a virtual private cloud (VPC). To use an EMR Serverless application, you must install Transformer in that VPC or in an accessible location. The application must also be able to access the origin and destination systems configured in the pipeline. To configure pipelines that create new applications, you need a VPC that can run the applications. You must specify subnet IDs from the VPC and security group IDs needed for communication.
Each pipeline has a staging location to store needed files. Pipelines started from the same Transformer instance can share a staging location, allowing Transformer to reuse the common files stored in that location.
- Authentication methods
- Information about an existing EMR Serverless application or details to define a new one
- Amazon S3 bucket and path for storage
- Path to the staging location
- Retry parameters for failed attempts
You can also use a connection to configure the pipeline.
The following image shows part of the Cluster tab of a pipeline configured to run on an EMR Serverless application:
VPC Access
Each EMR Serverless application resides in a virtual private cloud (VPC). You must install Transformer in that VPC or in a location that the VPC can access. Similarly, the origin and destination systems configured in the pipeline must be accessible from the VPC.
For pipelines that create new EMR Serverless applications, you need a VPC for the applications. You can use an existing VPC or create a new one. In the pipeline configuration, you specify the VPC subnet IDs that contain Transformer and the origin and destination systems configured in the pipeline. You also specify the security group IDs needed to communicate with Transformer and the origin and destination systems.
For pipelines that use an existing EMR Serverless application, you might need to update the application to select the subnets and security groups needed to access Transformer and the origin and destination systems configured in the pipeline.
StreamSets recommends installing Transformer on an Amazon EC2 instance launched into your VPC.
For information about VPCs, see the Amazon VPC documentation. For information about configuring VPC access for an EMR Serverless application, see the Amazon EMR Serverless documentation.
Authentication Method
You configure the authentication method that Transformer uses to connect to an EMR Serverless application. When you start the pipeline, Transformer uses the specified instance profile or AWS access key to launch the Spark application.
- Instance profile
- When Transformer runs on an Amazon EC2 instance that has an associated instance profile, Transformer uses the instance profile credentials to automatically authenticate with AWS.
- AWS access keys
- When Transformer does not run on an Amazon EC2 instance or when the EC2 instance doesn’t have an instance profile, you can authenticate using an AWS access key pair. When using an AWS access key pair, you specify the access key ID and secret access key to use.
- None
- You can optionally choose to connect anonymously using no authentication.
You can also temporarily assume a specified role to connect to the Amazon EMR Serverless application. For more information, see Assume Another Role.
Existing Application
You can configure a pipeline to run on an existing EMR Serverless application.
To configure the pipeline, click the Cluster tab on the pipeline properties, clear the Create a New Application property, and then specify the application to use. You can specify an application by application ID or by application name and optional tags.
Transformer and the origin and destination systems configured in the pipeline must be accessible from the VPC where the specified application resides. You might need to update the application to select the subnets and security groups needed to access Transformer and the origin and destination systems. For information about configuring VPC access for an EMR Serverless application, see the Amazon EMR Serverless documentation.
You can configure all pipelines that run on an existing application to use the same staging location. With this configuration, EMR Serverless can reuse the common files stored in that location.
For information about interacting with an EMR Serverless application, see the Amazon EMR Serverless documentation.
Specifying an Application
When you configure pipeline properties for an existing EMR Serverless application, you specify the application to use by application ID, by default. If you prefer, you can specify the application using the application name and optional tags.
The cluster name property is case-sensitive. If your EMR account has access to more than one application by the same name, define tags to differentiate between the applications.
To specify the application by ID, on the Cluster tab, enter the application ID in the Application ID property.
To specify the application by name and tags, on the Cluster tab, enable the Application by Name and Tags property. Then, specify a case-sensitive application name in the EMR Application Name property.
To add optional tags, in the EMR Application Tags property, click Add, then define a tag name and value for each tag that you want to define.
Newly Created Application
You can configure a pipeline to run on a newly created EMR Serverless application. When configured, Transformer creates the application upon the initial run of a pipeline. You can configure Transformer to terminate the application after the pipeline stops.
Before you configure a pipeline to run on a new application, you need a VPC for the applications. You can use an existing VPC or create a new one. The pipeline configuration requires you to specify subnet and security group IDs from the VPC. For information about VPCs, see the Amazon VPC documentation.
To configure a pipeline to run on a new application, click the Cluster tab on the pipeline properties, select the Create a New Application property, and configure the application properties. You specify details, such as the runtime role, the EMR version, IDs of the subnets that contain Transformer and the origin and destination systems configured in the pipeline, IDs of the security groups that can access Transformer and the origin and destination systems configured in the pipeline, and the maximum size of the application. You also indicate whether to stop the application after the pipeline stops.
You can also configure the pipeline to save log data to a different location to avoid losing that data when the application terminates.
For a full list of properties, see Configuring a Pipeline.
For more information about applications, see the Amazon EMR Serverless documentation.
Staging Location
Pipelines that run on an EMR Serverless application store files in an intermediate Amazon S3 location.
<S3 staging URI>/<staging directory>
The location must exist before you start the pipeline.
Pipelines that run from the same Transformer instance can use the same staging location. With the same staging location configured, Transformer can reuse the common files stored in that location. However, pipelines started by different Transformer instances must use different staging locations.
- Files that can be reused across pipelines
- Transformer stores files that can be reused across pipelines, including Transformer libraries and external resources such as JDBC drivers, in the following location:
- Files specific to each pipeline
- Transformer stores files specific to each pipeline, such as the pipeline JSON file and resource files used by the pipeline, in the following directory:
Monitoring EMR Serverless Pipelines
transformer.emr.monitoring.max.retry
- Defines how many times Transformer retries retrieving monitoring information. Default is 3.transformer.emr.monitoring.retry.base.backoff
- Defines the minimum number of seconds to wait between retries. Default is 15.transformer.emr.monitoring.retry.max.backoff
- Defines the maximum number of seconds to wait between retries. Default is 300, which is 5 minutes.
If Transformer cannot access monitoring information within the specified retries, it assumes that the Spark job for the pipeline is not running and stops the pipeline.
When needed, you can configure these properties in the Transformer configuration file,
$TRANSFORMER_CONF/transformer.properties
.