Amazon

Amazon EMR Cluster Manager Connection

Available when using an authoring Data Collector version 4.0.0 or later.

To create an Amazon EMR Cluster Manager connection, the EMR with Hadoop stage library, streamsets-datacollector-emr_hadoop_<version>-lib, must be installed on the selected authoring Data Collector.

For a description of the Amazon EMR Cluster Manager connection properties, see Amazon Connection Properties.

After you create an Amazon EMR Cluster Manager connection, you can use the connection in the following location:
Engine Location
Transformer 4.0.0 or later Pipeline configured to run on an Amazon EMR cluster

For information about features added to the connection with different engine releases, see the connection requirements for the engine.

Amazon EMR Serverless Connection

Available when using an authoring Data Collector version 5.3.0 or later.

To create an Amazon EMR Serverless connection, the EMR with Hadoop stage library, streamsets-datacollector-emr_hadoop_<version>-lib, must be installed on the selected authoring Data Collector.

For a description of the Amazon EMR Serverless connection properties, see Amazon Connection Properties.

After you create an Amazon EMR Serverless connection, you can use the connection in the following location:
Engine Location
Transformer 5.3.0 or later Pipeline configured to run on an Amazon EMR Serverless application

For information about features added to the connection with different engine releases, see the connection requirements for the engine.

Amazon Kinesis Firehose Connection

Available when using an authoring Data Collector version 4.0.0 or later.

To create an Amazon Kinesis Firehose connection, the Amazon Kinesis stage library, streamsets-datacollector-kinesis-lib, must be installed on the selected authoring Data Collector.

For a description of the Amazon Kinesis Firehose connection properties, see Amazon Connection Properties.

After you create an Amazon Kinesis Firehose connection, you can use the connection in the following stage:
Engine Stage
Data Collector 4.0.0 or later Kinesis Firehose destination

Amazon Kinesis Streams Connection

Available when using an authoring Data Collector version 4.0.0 or later.

To create an Amazon Kinesis Streams connection, the Amazon Kinesis stage library, streamsets-datacollector-kinesis-lib, must be installed on the selected authoring Data Collector.

For a description of the Amazon Kinesis Streams connection properties, see Amazon Connection Properties.

After you create an Amazon Kinesis Streams connection, you can use the connection in the following stages and locations:
Engine Stages and Locations
Data Collector 4.0.0 or later
  • Kinesis Consumer origin
  • Kinesis Producer destination
  • Write to Kinesis error record handling configured for a pipeline

Amazon Redshift Connection

Available when using an authoring Data Collector version 4.1.0 or later.

To create an Amazon Redshift connection, the Amazon Web Services stage library, streamsets-datacollector-aws-lib, must be installed on the selected authoring Data Collector.

For a description of the Amazon Redshift connection properties, see Amazon Redshift Properties.

After you create an Amazon Redshift connection, you can use the connection in the following stages and locations:
Engine Stages and Locations
Transformer 4.1.0 or later
  • Amazon Redshift origin
  • Amazon Redshift destination

Amazon S3 Connection

Available when using an authoring Data Collector version 4.0.0 or later.

To create an Amazon S3 connection, the Amazon Web Services stage library, streamsets-datacollector-aws-lib, must be installed on the selected authoring Data Collector.

For a description of the Amazon S3 connection properties, see Amazon Connection Properties.

After you create an Amazon S3 connection, you can use the connection in the following stages and locations:
Engine Stages and Locations

Data Collector 4.0.0 or later

  • Amazon S3 origin
  • Amazon S3 destination
  • Amazon S3 executor
  • Write to Amazon S3 error record handling configured for a pipeline

Transformer 4.0.0 or later

  • Amazon S3 origin
  • Amazon S3 destination

For information about features added to the connection with different engine releases, see the connection requirements for the engine.

Amazon SQS Connection

Available when using an authoring Data Collector version 4.0.0 or later.

To create an Amazon SQS connection, the Amazon Web Services stage library, streamsets-datacollector-aws-lib, must be installed on the selected authoring Data Collector.

For a description of the Amazon SQS connection properties, see Amazon Connection Properties.

After you create an Amazon SQS connection, you can use the connection in the following stage:
Engine Stage
Data Collector 4.0.0 or later Amazon SQS Consumer origin

Amazon Security

You can configure Amazon connections to use one of the following authentication methods to connect securely to Amazon Web Services (AWS):

Instance profile
When Data Collector or Transformer runs on an Amazon EC2 instance that has an associated instance profile, the engine uses the instance profile credentials to automatically authenticate with AWS.
The IAM policies attached to the instance profile must have permissions to read from or write to Amazon S3 and to the Redshift cluster, depending on how you use the connection.
For more information about associating an instance profile with an EC2 instance, see the Amazon EC2 documentation.
AWS access keys
When the execution engine does not run on an Amazon EC2 instance or when the EC2 instance doesn’t have an instance profile, you can authenticate using an AWS access key pair. When using an AWS access key pair, you specify the access key ID and secret access key to use.
The AWS access key pair must have permissions to perform all required tasks. For example, for Amazon Redshift connections, the AWS access key pair must have permissions to read from or write to Amazon S3 and to the Redshift cluster, depending on how you use the connection.
When using AWS access keys with Amazon Redshift connections, you must also install a JDBC driver. For more information, see the Transformer documentation.
Note: When configuring an Amazon EMR Cluster Manager, Amazon EMR Serverless, Amazon S3, or Amazon SQS connection, you can connect anonymously using no authentication.

Assume Another Role

When using instance profile or AWS access keys authentication, you can configure an Amazon connection to assume another IAM role.

For example, if the instance profile or the IAM user permissions do not grant access to write to Amazon S3 resources, you can configure an Amazon S3 connection used in an Amazon S3 destination to assume another role that does grant write access.
Note: Assuming another role is not available for Amazon Redshift connections. For other connection types, Transformer supports assuming another role when the pipeline meets the stage library and cluster type requirements.

When an Amazon connection assumes a role, it temporarily gives up the instance profile or IAM user permissions and uses the permissions assigned to the assumed role. To assume a role, the connection calls the AWS STS AssumeRole API operation and passes the role to use. The operation creates a new session with the temporary credentials, as long as the following conditions are true:

  • The IAM policy attached to the current principal - the IAM role or user - grants permission to assume the specified role.
  • The IAM trust policy attached to the role to be assumed permits the current principal to assume it.

Assume Role Methods

You can configure an Amazon connection to assume a role in the following ways:
Assume a role with no restrictions

When configured to assume a role with no restrictions, any StreamSets user account that starts the pipeline can assume the role specified in the Amazon connection, as long as the IAM policies attached to the current principal and to the role to be assumed allow it.

For example, any StreamSets user who starts the job for the pipeline can assume the finance role when the IAM trust policy attached to the finance role allows the role to be assumed by the IAM role or user identified by the selected authentication method.

Assume a role using an external ID condition
When using an Amazon EMR Cluster Manager, Amazon EMR Serverless, or Amazon S3 connection, you can configure the connection to use an external ID condition when assuming a role.
When you use an external ID condition, any StreamSets user account that starts the pipeline can assume the role specified in the connection when the IAM policies attached to the current principal and to the role to be assumed allow it. However, the IAM policies attached to the role to be assumed must include an external ID condition, and the specified external ID must be defined in the connection.
When configured to use an external ID, the connection passes the following condition to the AWS STS AssumeRole API operation:
"Condition": {"StringEquals": {"sts:ExternalId": "<external id>"}}
Where the <external id> is a unique ID specified in the External ID property of the connection and is defined in the IAM trust policy attached to the role to be assumed.
AWS IAM verifies that the StreamSets user account that starts the pipeline can assume the role specified in the connection, and that the external ID specified in the connection matches the external ID in the IAM trust policy attached to the role to be assumed.
For example, any StreamSets user who starts the job for the pipeline can assume the finance role when both of the following are true:
  • The IAM trust policy attached to the finance role allows the role to be assumed by the IAM role or user identified by the selected authentication method.
  • The IAM trust policy attached to the finance role includes an external ID, such as finance-235A9df84iK, and the connection used in the pipeline includes the same external ID.
For more information about using an external ID, see the AWS documentation.
Tip: To provide additional security, use session tags with an external ID condition.
Assume a role using session tags to restrict role access
For increased security, you can configure a connection to assume a role and set session tags to restrict the user accounts allowed to assume the role. When configured to set session tags, the connection passes the following session tag to the AWS STS AssumeRole API operation:

streamsets/principal=<user>

Where <user> is the name of the currently logged in StreamSets user that starts the pipeline or job for the pipeline.

AWS IAM verifies that the user account set in the session tag can assume the specified role. The IAM trust policy attached to the role to be assumed must allow the current principal permission to assume the role and must have constraints using IAM condition keys to limit the AssumeRole action based on the requested session tags.

For example, when the StreamSets user Joe starts the job for the pipeline, he can assume the finance role when the IAM trust policy attached to the finance role allows the user joe to assume the role. The StreamSets user Emily cannot assume the finance role because the trust policy attached to the finance role does not allow the user emily to assume the role.

To configure an Amazon connection to assume a role, you first must create the trust policy in AWS that allows the role to be assumed. Then, you configure the required connection properties in Control Hub.

Create the Trust Policy

In AWS, create and attach a trust policy to the role to be assumed. The policy must allow other principals - IAM roles or users - to assume the role.

Important: You must also attach a policy to the principal that grants permission to the principal to assume another role. AWS IAM provides several methods of granting a principal access to assume another role. For details, see the AWS IAM documentation.

The trust policy that you create for the role to be assumed depends on whether you want to allow connections to assume the role with or without restrictions:

Trust policy to assume the role with no restrictions
Create and attach a trust policy to the role to be assumed that allows another IAM role or user to assume the role.
For example, if using instance profile authentication, you might create the following policy where:
  • <account_id> is the ID of your AWS account.
  • <role_name> is the name of the role permitted to assume this role. Enter the name of the role included in the instance profile associated with the Amazon EC2 instance where the StreamSets engine runs.
{
 "Version": "2022-10-17",
 "Statement": [
   {
     "Sid": "",
     "Effect": "Allow",
     "Principal": {
       "AWS": "arn:aws:iam::<account_id>:role/<role_name>"
     },
     "Action": [
       "sts:AssumeRole"
     ]
   }
 ]
}
If using AWS access keys authentication, create a similar trust policy. However, for the principal, specify the ARN of the IAM user permitted to assume this role. Enter the name of the IAM user that owns the access keys used to authenticate with AWS. For example:
...
"Principal": {
       "AWS": "arn:aws:iam::<account_id>:user/<user_name>"
},
...
Trust policy to assume a role with an external ID condition
When using an Amazon S3, Amazon EMR Cluster Manager, or Amazon EMR Serverless connection, you can use an external ID condition to restrict access to a role.
Create and attach a trust policy to the role to be assumed that allows the IAM role or user to assume the role and that includes an external ID condition.
For example, if using instance profile authentication, you might create the following policy where:
  • <account_id> is the ID of your AWS account.
  • <role_name> is the name of the role permitted to assume this role. Enter the name of the role included in the instance profile associated with the Amazon EC2 instance where the StreamSets engine runs.
  • <external_id> is the external ID that must be included in the connection.
{
 "Version": "2022-10-17",
 "Statement": [
   {
     "Sid": "",
     "Effect": "Allow",
     "Principal": {
       "AWS": "arn:aws:iam::<account_id>:role/<role_name>"
     },
     "Action": [
       "sts:AssumeRole"
     ],
     "Condition": {
       "StringEquals": {
         "sts:ExternalId": "<external_id>"
        }
     }
   }
 ]
}
If using AWS access keys authentication, create a similar trust policy.
For more information about using an external ID, see the AWS documentation.
Trust policy to assume a role using session tags to restrict role access
Create and attach a trust policy to the role to be assumed that allows the IAM role or user to assume the role, uses session tags, and restricts the session tag values to specific StreamSets user accounts.
For example, if using instance profile authentication, you might create the following policy where:
  • <account_id> is the ID of your AWS account.
  • <role_name> is the name of the role permitted to assume this role. Enter the name of the role included in the instance profile associated with the Amazon EC2 instance where the StreamSets engine runs.
  • <user1> and <user2> are StreamSets user accounts allowed to assume this role. To specify a Control Hub user account, use the required naming convention: <user ID>@<organization ID>. For example, joe@MyCompany.
{
 "Version": "2022-10-17",
 "Statement": [
   {
     "Sid": "",
     "Effect": "Allow",
     "Principal": {
       "AWS": "arn:aws:iam::<account_id>:role/<role_name>"
     },
     "Action": [
       "sts:AssumeRole",
       "sts:TagSession"
     ],
     "Condition": {
       "StringEquals": {
         "aws:RequestTag/streamsets/principal": ["<user1>", "<user2>"]
       },
        "Null": {
          "aws:RequestTag/streamsets/principal": "false"
        }
     }
   }
 ]
}
If using AWS access keys authentication, create a similar trust policy. However, for the principal, specify the ARN of the IAM user permitted to assume this role. Enter the name of the IAM user that owns the access keys used to authenticate with AWS. For example:
...
"Principal": {
       "AWS": "arn:aws:iam::<account_id>:user/<user_name>"
},
...

For more information about creating an IAM trust policy, see the AWS IAM documentation.

Configure Connections to Assume a Role

After you create and attach a trust policy to the role to be assumed, you can configure Amazon connections to assume the role.

  1. On the primary tab of the Amazon connection, select AWS Keys or Instance Profile for the Authentication Method property.
    Note: Assuming another role is not available for Amazon Redshift connections. For other connection types, Transformer supports assuming another role when the pipeline meets the stage library and cluster type requirements.
  2. Select Assume Role.
  3. Configure the following properties:
    Assume Role Property Description
    Role ARN

    Amazon resource name (ARN) of the role to assume, entered in the following format:

    arn:aws:iam::<account_id>:role/<role_name>

    Where <account_id> is the ID of your AWS account and <role_name> is the name of the role to assume. You must create and attach an IAM trust policy to this role that allows the role to be assumed.

    Role Session Name

    Optional name for the session created by assuming a role. Overrides the default unique identifier for the session.

    Session Timeout

    Maximum number of seconds for each session created by assuming a role. The session is refreshed if the pipeline continues to run for longer than this amount of time.

    Set to a value between 3,600 seconds and 43,200 seconds.

    Set Session Tags

    Sets a session tag to record the name of the currently logged in StreamSets user that starts the pipeline or the job for the pipeline. AWS IAM verifies that the user account set in the session tag can assume the specified role.

    Select only when the IAM trust policy attached to the role to be assumed uses session tags and restricts the session tag values to specific user accounts.

    When cleared, the connection does not set a session tag.

    External ID External ID included in an IAM trust policy that allows the specified role to be assumed.

    Available for Amazon, Amazon EMR, and Amazon EMR Serverless connections.

Amazon Connection Properties

You configure the following types of properties for Amazon connections:

Common Properties

Common properties are used by all Amazon connections, except the Amazon Redshift connection.

When creating all other Amazon connections, configure the following properties on the primary tab of the connection:
Common Property Description
Authentication Method Authentication method used to connect to Amazon Web Services (AWS):
  • AWS Keys - Authenticates using an AWS access key pair.
  • Instance Profile - Authenticates using an instance profile associated with the Data Collector or Transformer EC2 instance.
  • None - Connects anonymously using no authentication. Available only for Amazon EMR Cluster Manager, Amazon EMR Serverless, Amazon S3, or Amazon SQS connections.
Access Key ID AWS access key ID. Required when using AWS keys to authenticate with AWS.
Secret Access Key AWS secret access key. Required when using AWS keys to authenticate with AWS.
Tip: To secure sensitive information, you can use credential stores or runtime resources.
Assume Role Temporarily assumes another role to authenticate with AWS.
Note: Assuming another role is not available for Amazon Redshift connections. For other connection types, Transformer supports assuming another role when the pipeline meets the stage library and cluster type requirements.
Role ARN

Amazon resource name (ARN) of the role to assume, entered in the following format:

arn:aws:iam::<account_id>:role/<role_name>

Where <account_id> is the ID of your AWS account and <role_name> is the name of the role to assume. You must create and attach an IAM trust policy to this role that allows the role to be assumed.

Available when assuming another role.

Role Session Name

Optional name for the session created by assuming a role. Overrides the default unique identifier for the session.

Available when assuming another role.

Session Timeout

Maximum number of seconds for each session created by assuming a role. The session is refreshed if the pipeline continues to run for longer than this amount of time.

Set to a value between 3,600 seconds and 43,200 seconds.

Available when assuming another role.

Set Session Tags

Sets a session tag to record the name of the currently logged in StreamSets user that starts the pipeline or the job for the pipeline. AWS IAM verifies that the user account set in the session tag can assume the specified role.

Select only when the IAM trust policy attached to the role to be assumed uses session tags and restricts the session tag values to specific user accounts.

When cleared, the connection does not set a session tag.

Available when assuming another role.

External ID External ID included in an IAM trust policy that allows the specified role to be assumed.

Available for Amazon, Amazon EMR, and Amazon EMR Serverless connections.

Available when assuming another role.

Use Specific Region Specify the AWS region or endpoint to connect to.

When cleared, the connection uses the Amazon S3 default global endpoint, s3.amazonaws.com.

Available only for Amazon S3 connections.

Region AWS region to connect to. Select one of the available regions. To specify an endpoint to connect to, select Other.
Endpoint Endpoint to connect to when you select Other for the region. Enter the endpoint name.
Use Custom Endpoint Specify a specific signing region when connecting to a custom endpoint.

When cleared, the connection uses the region specified in the endpoint.

Available only for Amazon S3 connections, when using Data Collector 4.4.0 or later.

Signing Region AWS region used by the custom endpoint.

Advanced Properties

You can optionally configure properties on the Advanced tab for the following connection types:
  • Amazon Kinesis Firehose
  • Amazon Kinesis Streams
  • Amazon S3
  • Amazon SQS
The defaults for the following Advanced properties should work in most cases:
Advanced Property Description
Connection Timeout Seconds to wait for a response before closing the connection.
Socket Timeout Seconds to wait for a response to a query.
Retry Count Maximum number of times to retry requests.
Use Proxy Specifies whether to use a proxy to connect.
Proxy Host Proxy host.
Proxy Port Proxy port.
Proxy User User name for proxy credentials.
Proxy Password Password for proxy credentials.
Tip: To secure sensitive information, you can use credential stores or runtime resources.
Proxy Domain Optional domain name for the proxy server.
Proxy Workstation Optional workstation for the proxy server.

Amazon Redshift Properties

When creating an Amazon Redshift connection, configure the following property on the Redshift tab:
Redshift Property Description
Redshift Endpoint Amazon Redshift endpoint to use.
On the Credentials tab, configure the following properties:
Credential Property Description
Security Authentication method used to connect to Amazon Web Services (AWS):
  • AWS Keys - Authenticates using an AWS access key pair.

    When using AWS access keys, the AWS access key pair must have permissions to read from or write to Amazon S3 and to the Redshift cluster, depending on how you use the connection. You must also install a JDBC driver. For more information, see the Transformer documentation.

  • Instance Profile - Authenticates using an instance profile associated with the Data Collector or Transformer EC2 instance.

    When using an instance profile, the IAM policies attached to the instance profile must have permissions to read from or write to Amazon S3 and to the Redshift cluster, depending on how you use the connection.

Access Key ID AWS access key ID. Required when using AWS keys to authenticate with AWS.
Secret Access Key AWS secret access key. Required when using AWS keys to authenticate with AWS.
Tip: To secure sensitive information, you can use credential stores or runtime resources.
DB User Database user that Transformer impersonates when writing to the database. The user must have write permission for the database table.
DB Password Password for the database user account.

Available when using Instance Profile security.

IAM Role for Unload to S3 ARN of the IAM role assigned to the Redshift cluster. Transformer uses the role to write to the specified S3 staging location. The role must have write permission for the S3 staging location.

Available when using Instance Profile security.

Auto-Create DB User Enables creating a database user to write data to Redshift.

Available when using AWS Keys security.

DB Groups Comma-delimited list of existing database groups for the database user to join for the duration of the pipeline run. The specified groups must have write permission for the S3 staging location.

Available when using AWS Keys security.

EMR Cluster Manager Properties

When creating an Amazon EMR Cluster Manager connection, first configure the appropriate common properties. Then, continue configuring the following EMR cluster manager properties:
EMR Property Description
S3 Staging URI Amazon S3 bucket and path used to store the Transformer resources and files needed to run the pipeline.

The specified bucket and path must exist before you run the pipeline.

Use the following format:
s3://<bucket>/<path>
Define Cluster Start Option Defines how the pipeline accesses the cluster to run the pipeline:
  • Existing Cluster - Uses an existing cluster. Specify the cluster by cluster ID or name and tags.

    For more information about using an existing cluster, see the Transformer documentation.

  • Provision New Cluster - Uses a cluster provisioned by Transformer based on the specified properties.

    For more information about provisioning a cluster, see the Transformer documentation.

  • AWS Service Catalog - Uses a cluster provisioned by AWS Service Catalog based on an EMR cluster product template and specified properties. Provisioning a cluster with AWS Service Catalog requires completing

    For more information about provisioning a cluster with AWS Service Catalog, see the Transformer documentation.

Note: This property and provisioning a cluster with AWS Service Catalog are available with authoring Data Collector 5.10.0 and later. Use the connection with Transformer 5.7.0 or later.
Provision a New Cluster Transformer provisions a new cluster to run the pipeline. When this property is cleared, Transformer uses the specified existing cluster.
Tip: Terminating a provisioned cluster after the pipeline stops is a cost-effective method of running a Transformer pipeline.

Available with authoring Data Collector 5.9.x and earlier. Use the connection with Transformer 5.6.x or earlier.

For more information about running a pipeline on a provisioned cluster, see the Transformer documentation, version 5.6.x or earlier.

Cluster by Name and Tags Enables specifying a cluster by cluster name and tags instead of by cluster ID.

Available when using an existing cluster and when using an authoring Data Collector 5.4.0 or later. Use the connection with Transformer 5.4.0 or later.

For more information, see the Transformer documentation.

Cluster Name Name of the existing cluster to run the pipeline. This property is case-sensitive.

Available when specifying an existing cluster by name and tags.

Cluster Tags Tag name and values to use to differentiate between multiple clusters with the specified cluster name.

Click Add to define a tag. Click Add Another to define additional tags.

Available when specifying an existing cluster by name and tags.

Cluster ID ID of the existing cluster to run the pipeline.

For more information, see the Transformer documentation.

EMR Version EMR cluster version to provision. Transformer supports version 5.20.0 or later 5.x versions.

Available for clusters provisioned by Transformer.

Cluster Name Prefix Prefix for the name of the provisioned EMR cluster.

Available for clusters provisioned by Transformer.

Terminate Cluster Terminates the provisioned cluster when the pipeline stops.

When cleared, the cluster remains active after the pipeline stops.

Available for clusters provisioned by Transformer.

Logging Enabled Enables copying log data to a specified Amazon S3 location. Use to preserve log data that would otherwise become unavailable when the provisioned cluster terminates.

Available for clusters provisioned by Transformer.

S3 Log URI Location in Amazon S3 to store pipeline log data.
Location must be unique for each pipeline. Use the following format:
s3://<bucket>/<path>

The bucket must exist before you start the pipeline.

Available when you enable logging for a cluster provisioned by Transformer.

Service Role EMR role used by the Transformer EC2 instance to provision resources and perform other service-level tasks.

Default is EMR_DefaultRole. For more information about configuring roles for Amazon EMR, see the Amazon EMR documentation.

Available for clusters provisioned by Transformer.

Job Flow Role EMR role for the EC2 instances within the cluster used to perform pipeline tasks.

Default is EMR_EC2_DefaultRole. For more information about configuring roles for Amazon EMR, see the Amazon EMR documentation.

Available for clusters provisioned by Transformer.

SSH EC2 Key ID SSH key used to access the EMR cluster nodes.

Transformer does not use or require an SSH key to access the nodes. Enter an SSH key ID if you plan to connect to the nodes using SSH for monitoring or troubleshooting purposes.

For more information about using SSH keys to access EMR cluster nodes, see the Amazon EMR documentation.

Available for clusters provisioned by Transformer.

Visible to All Users Enables all AWS Identity and Access Management (IAM) users under your account to access the provisioned cluster.

Available for clusters provisioned by Transformer.

EC2 Subnet ID EC2 subnet identifier to launch the provisioned cluster in.

Available for clusters provisioned by Transformer.

Master Security Group ID of the security group on the master node in the cluster.
Note: Verify that the master security group allows Transformer to access the master node in the EMR cluster. For information on configuring security groups for EMR clusters, see the Amazon EMR documentation.

Available for clusters provisioned by Transformer.

Slave Security Group Security group ID for the slave nodes in the cluster.

Available for clusters provisioned by Transformer.

Instance Count Number of EC2 instances to use. Each instance corresponds to a slave node in the EMR cluster.

Minimum is 2. Using an additional instance for each partition can improve pipeline performance.

Available for clusters provisioned by Transformer.

Master Instance Type EC2 instance type for the master node in the EMR cluster.

If an instance type does not display in the list, select Custom and then enter the instance type.

Available for clusters provisioned by Transformer.

Master Instance Type (Custom) Custom EC2 instance type for the master node.

Available when you select Custom for the Master Instance Type property.

Slave Instance Type EC2 instance type for the slave nodes in the EMR cluster.

If an instance type does not display in the list, select Custom and then enter the instance type.

Available for clusters provisioned by Transformer.

Slave Instance Type (Custom) Custom EC2 instance type for the master node.

Available when you select Custom for the Slave Instance Type property.

AWS Cluster Tags AWS tags assigned to each EMR cluster provisioned. For each tag, specify a tag name or key and a tag value.

Available for clusters provisioned by Transformer.

Provisioned Product Name Name for the provisioned cluster.

Available for clusters provisioned by AWS Service Catalog when not generating the product name.

Generate Product Name Transformer generates the product name for the cluster to be provisioned by AWS Service Catalog.

Available for clusters provisioned by AWS Service Catalog.

Project ID Project ID for the cluster to be provisioned by AWS Service Catalog.

Available for clusters provisioned by AWS Service Catalog.

Version Name Version name for the cluster to be provisioned by AWS Service Catalog.

Available for clusters provisioned by AWS Service Catalog.

Parameters Optional parameter names and values to pass to AWS Service Catalog. They must correspond to parameters allowed by your product template. For more information, see the AWS documentation.

Available for clusters provisioned by AWS Service Catalog.

Terminate Provisioned Product Terminates the provisioned cluster and associated AWS Service Catalog product when the pipeline stops.

Available for clusters provisioned by AWS Service Catalog.

Max Retries Maximum number of times to retry a failed request or throttling error.
Retry Base Delay Base delay in milliseconds for retrying after a failed request. The specified number is doubled for each subsequent retry, up to the value specified for the Max Backoff property.
Throttling Retry Base Delay Base delay in milliseconds for retrying after a throttling error. The specified number is doubled for each subsequent retry, up to the value specified for the Max Backoff property.
Max Backoff The maximum number of milliseconds to wait between retries. Limits the delay between retries after failed requests and throttling errors.

EMR Serverless Properties

When creating an Amazon EMR Serverless connection, first configure the appropriate common properties. Then, continue configuring the following EMR Serverless properties:
EMR Serverless Property Description
S3 Staging URI Amazon S3 bucket and path used to store the Transformer resources and files needed to run the pipeline.

The specified bucket and path must exist before you run the pipeline.

Use the following format:
s3://<bucket>/<path>
Create a New Application Creates a new EMR Serverless application to run the pipeline.
Application by Name and Tags Enables specifying an application by application name and tags instead of by application ID.

Available when not creating a new application and when using an authoring Data Collector version 5.4.0 or later. Use the connection with Transformer 5.4.0 or later.

For more information, see the Transformer documentation.

EMR Application Name Name of the existing application to run the pipeline. This property is case-sensitive.

Available when specifying an application by name and tags.

EMR Application Tags Tag name and values to use to differentiate between multiple applications with the specified application name.

Click Add to define a tag. Click Add Another to define additional tags.

Available when specifying an application by name and tags.

Application ID ID of an existing EMR Serverless application to run the pipeline.

Available when not creating a new application and when not specifying an application by name and tags.

For more information, see the Transformer documentation.

Runtime Role ARN Identity and Access Management (IAM) role used by the job. The role must have access to the data sources, targets, scripts, and libraries that the job uses.

Enter the role in the following format:

arn:aws:iam::<account_id>:role/<role_name>

Where <account_id> is the ID of your AWS account and <role_name> is the name of the role to assume. You must create and attach an IAM trust policy to this role that allows the role to be assumed.

EMR Version Amazon EMR version of created application. Transformer supports version 6.9.0 and later 6.x.

Available when creating a new application.

Application Name Prefix Prefix automatically added to the EMR Serverless application name.

A prefix can help you identify the created applications in the EMR Studio console.

Available when creating a new application.

Stop Application Stops the EMR Serverless application when the pipeline stops.

Available when creating a new application.

Subnet IDs IDs of the subnets that contain Transformer and the origin and destination systems configured in the pipeline. Specify subnets in the virtual private cloud (VPC) where the EMR Serverless application resides.

Available when creating a new application.

Security Group IDs ID of one or more security groups that can communicate with Transformer and the origin and destination systems configured in the pipeline.

Available when creating a new application.

Maximum CPU (vCPU) Maximum number of vCPUs that the application can scale to.

Available when creating a new application.

Maximum Memory (GB) Maximum memory, specified in GB, that the application can scale to.

Available when creating a new application.

Maximum Disk (GB) Maximum disk size, specified in GB, that the application can scale to.

Available when creating a new application.

Logging Enabled Copies log files from job runs to a specified Amazon S3 location. Select to provide access to log data after the application stops.
S3 Log URI Location in Amazon S3 to store pipeline log data.
Location must be unique for each pipeline. Use the following format:
s3://<bucket>/<path>

The bucket must exist before you start the pipeline.

Available when Logging Enabled is selected.

AWS Tags AWS tags assigned to all applications and job runs created. For each tag, specify a tag name or key and a tag value.
Max Retries Maximum number of times to retry a failed request or throttling error.
Retry Base Delay Base delay in milliseconds for retrying after a failed request. The specified number is doubled for each subsequent retry, up to the value specified for the Max Backoff property.
Throttling Retry Base Delay Base delay in milliseconds for retrying after a throttling error. The specified number is doubled for each subsequent retry, up to the value specified for the Max Backoff property.
Max Backoff The maximum number of milliseconds to wait between retries. Limits the delay between retries after failed requests and throttling errors.