Amazon
Amazon EMR Cluster Manager Connection
Available when using an authoring Data Collector version 4.0.0 or later.
To create an Amazon EMR Cluster Manager connection, the EMR with Hadoop stage library,
streamsets-datacollector-emr_hadoop_<version>-lib
, must be
installed on the selected authoring Data Collector.
For a description of the Amazon EMR Cluster Manager connection properties, see Amazon Connection Properties.
Engine | Location |
---|---|
Transformer 4.0.0 or later | Pipeline configured to run on an Amazon EMR cluster |
For information about features added to the connection with different engine releases, see the connection requirements for the engine.
Amazon EMR Serverless Connection
Available when using an authoring Data Collector version 5.3.0 or later.
To create an Amazon EMR Serverless connection, the EMR with Hadoop stage library,
streamsets-datacollector-emr_hadoop_<version>-lib
, must be
installed on the selected authoring Data Collector.
For a description of the Amazon EMR Serverless connection properties, see Amazon Connection Properties.
Engine | Location |
---|---|
Transformer 5.3.0 or later | Pipeline configured to run on an Amazon EMR Serverless application |
For information about features added to the connection with different engine releases, see the connection requirements for the engine.
Amazon Kinesis Firehose Connection
Available when using an authoring Data Collector version 4.0.0 or later.
To create an Amazon Kinesis Firehose connection, the Amazon Kinesis stage library,
streamsets-datacollector-kinesis-lib
, must be installed on the
selected authoring Data Collector.
For a description of the Amazon Kinesis Firehose connection properties, see Amazon Connection Properties.
Engine | Stage |
---|---|
Data Collector 4.0.0 or later | Kinesis Firehose destination |
Amazon Kinesis Streams Connection
Available when using an authoring Data Collector version 4.0.0 or later.
To create an Amazon Kinesis Streams connection, the Amazon Kinesis stage library,
streamsets-datacollector-kinesis-lib
, must be installed on the
selected authoring Data Collector.
For a description of the Amazon Kinesis Streams connection properties, see Amazon Connection Properties.
Engine | Stages and Locations |
---|---|
Data Collector 4.0.0 or later |
|
Amazon Redshift Connection
Available when using an authoring Data Collector version 4.1.0 or later.
To create an Amazon Redshift connection, the Amazon Web Services stage library,
streamsets-datacollector-aws-lib
, must be installed on the selected
authoring Data Collector.
For a description of the Amazon Redshift connection properties, see Amazon Redshift Properties.
Engine | Stages and Locations |
---|---|
Transformer 4.1.0 or later |
|
Amazon S3 Connection
Available when using an authoring Data Collector version 4.0.0 or later.
To create an Amazon S3 connection, the Amazon Web Services stage library,
streamsets-datacollector-aws-lib
, must be installed on the selected
authoring Data Collector.
For a description of the Amazon S3 connection properties, see Amazon Connection Properties.
Engine | Stages and Locations |
---|---|
Data Collector 4.0.0 or later |
|
Transformer 4.0.0 or later |
|
For information about features added to the connection with different engine releases, see the connection requirements for the engine.
Amazon SQS Connection
Available when using an authoring Data Collector version 4.0.0 or later.
To create an Amazon SQS connection, the Amazon Web Services stage library,
streamsets-datacollector-aws-lib
, must be installed on the selected
authoring Data Collector.
For a description of the Amazon SQS connection properties, see Amazon Connection Properties.
Engine | Stage |
---|---|
Data Collector 4.0.0 or later | Amazon SQS Consumer origin |
Amazon Security
You can configure Amazon connections to use one of the following authentication methods to connect securely to Amazon Web Services (AWS):
- Instance profile
- When Data Collector or Transformer runs on an Amazon EC2 instance that has an associated instance profile, the engine uses the instance profile credentials to automatically authenticate with AWS.
- AWS access keys
- When the execution engine does not run on an Amazon EC2 instance or when the EC2 instance doesn’t have an instance profile, you can authenticate using an AWS access key pair. When using an AWS access key pair, you specify the access key ID and secret access key to use.
Assume Another Role
When using instance profile or AWS access keys authentication, you can configure an Amazon connection to assume another IAM role.
When an Amazon connection assumes a role, it temporarily gives up the instance profile or IAM user permissions and uses the permissions assigned to the assumed role. To assume a role, the connection calls the AWS STS AssumeRole API operation and passes the role to use. The operation creates a new session with the temporary credentials, as long as the following conditions are true:
- The IAM policy attached to the current principal - the IAM role or user - grants permission to assume the specified role.
- The IAM trust policy attached to the role to be assumed permits the current principal to assume it.
Assume Role Methods
- Assume a role with no restrictions
-
When configured to assume a role with no restrictions, any StreamSets user account that starts the pipeline can assume the role specified in the Amazon connection, as long as the IAM policies attached to the current principal and to the role to be assumed allow it.
For example, any StreamSets user who starts the job for the pipeline can assume the
finance
role when the IAM trust policy attached to thefinance
role allows the role to be assumed by the IAM role or user identified by the selected authentication method. - Assume a role using an external ID condition
- When using an Amazon EMR Cluster Manager, Amazon EMR Serverless, or Amazon S3 connection, you can configure the connection to use an external ID condition when assuming a role.
- Assume a role using session tags to restrict role access
- For increased security, you can configure a connection to assume a role and set session tags to restrict the
user accounts allowed to assume the role. When configured to set session tags,
the connection passes the following session tag to the AWS STS AssumeRole
API operation:
streamsets/principal=<user>
Where
<user>
is the name of the currently logged in StreamSets user that starts the pipeline or job for the pipeline.AWS IAM verifies that the user account set in the session tag can assume the specified role. The IAM trust policy attached to the role to be assumed must allow the current principal permission to assume the role and must have constraints using IAM condition keys to limit the AssumeRole action based on the requested session tags.
For example, when the StreamSets user Joe starts the job for the pipeline, he can assume the
finance
role when the IAM trust policy attached to thefinance
role allows the userjoe
to assume the role. The StreamSets user Emily cannot assume thefinance
role because the trust policy attached to thefinance
role does not allow the useremily
to assume the role.
To configure an Amazon connection to assume a role, you first must create the trust policy in AWS that allows the role to be assumed. Then, you configure the required connection properties in Control Hub.
Create the Trust Policy
In AWS, create and attach a trust policy to the role to be assumed. The policy must allow other principals - IAM roles or users - to assume the role.
The trust policy that you create for the role to be assumed depends on whether you want to allow connections to assume the role with or without restrictions:
- Trust policy to assume the role with no restrictions
- Create and attach a trust policy to the role to be assumed that allows another IAM role or user to assume the role.
- Trust policy to assume a role with an external ID condition
- When using an Amazon S3, Amazon EMR Cluster Manager, or Amazon EMR Serverless connection, you can use an external ID condition to restrict access to a role.
- Trust policy to assume a role using session tags to restrict role access
- Create and attach a trust policy to the role to be assumed that allows the IAM role or user to assume the role, uses session tags, and restricts the session tag values to specific StreamSets user accounts.
For more information about creating an IAM trust policy, see the AWS IAM documentation.
Configure Connections to Assume a Role
After you create and attach a trust policy to the role to be assumed, you can configure Amazon connections to assume the role.
-
On the primary tab of the Amazon connection, select AWS
Keys or Instance Profile for the
Authentication Method property.
Note: Assuming another role is not available for Amazon Redshift connections. For other connection types, Transformer supports assuming another role when the pipeline meets the stage library and cluster type requirements.
- Select Assume Role.
-
Configure the following properties:
Assume Role Property Description Role ARN Amazon resource name (ARN) of the role to assume, entered in the following format:
arn:aws:iam::<account_id>:role/<role_name>
Where
<account_id>
is the ID of your AWS account and<role_name>
is the name of the role to assume. You must create and attach an IAM trust policy to this role that allows the role to be assumed.Role Session Name Optional name for the session created by assuming a role. Overrides the default unique identifier for the session.
Session Timeout Maximum number of seconds for each session created by assuming a role. The session is refreshed if the pipeline continues to run for longer than this amount of time.
Set to a value between 3,600 seconds and 43,200 seconds.
Set Session Tags Sets a session tag to record the name of the currently logged in StreamSets user that starts the pipeline or the job for the pipeline. AWS IAM verifies that the user account set in the session tag can assume the specified role.
Select only when the IAM trust policy attached to the role to be assumed uses session tags and restricts the session tag values to specific user accounts.
When cleared, the connection does not set a session tag.
External ID External ID included in an IAM trust policy that allows the specified role to be assumed. Available for Amazon, Amazon EMR, and Amazon EMR Serverless connections.
Amazon Connection Properties
- Common properties - Used by all Amazon connections except Amazon Redshift.
- Advanced properties - Used by Amazon Kinesis Firehose, Kinesis Streams, S3, and SQS connections.
- Amazon Redshift properties - Used by Amazon Redshift connections.
- EMR cluster manager properties - Used by Amazon EMR Cluster Manager connections.
- EMR Serverless properties - Used by Amazon EMR Serverless connections.
Common Properties
Common properties are used by all Amazon connections, except the Amazon Redshift connection.
Common Property | Description |
---|---|
Authentication Method | Authentication method used to connect to Amazon Web Services
(AWS):
|
Access Key ID | AWS access key ID. Required when using AWS keys to authenticate with AWS. |
Secret Access Key | AWS secret access key. Required when
using AWS keys to authenticate with AWS. Tip: To secure sensitive
information, you can use credential stores or runtime
resources. |
Assume Role | Temporarily assumes another role to
authenticate with AWS. Note: Assuming another role is not available for
Amazon Redshift connections. For other connection types, Transformer
supports assuming another role when
the pipeline meets the stage library and cluster type
requirements.
|
Role ARN |
Amazon resource name (ARN) of the role to assume, entered in the following format:
Where Available when assuming another role. |
Role Session Name |
Optional name for the session created by assuming a role. Overrides the default unique identifier for the session. Available when assuming another role. |
Session Timeout |
Maximum number of seconds for each session created by assuming a role. The session is refreshed if the pipeline continues to run for longer than this amount of time. Set to a value between 3,600 seconds and 43,200 seconds. Available when assuming another role. |
Set Session Tags |
Sets a session tag to record the name of the currently logged in StreamSets user that starts the pipeline or the job for the pipeline. AWS IAM verifies that the user account set in the session tag can assume the specified role. Select only when the IAM trust policy attached to the role to be assumed uses session tags and restricts the session tag values to specific user accounts. When cleared, the connection does not set a session tag. Available when assuming another role. |
External ID | External ID
included in an IAM trust policy that allows the specified role to be
assumed. Available for Amazon, Amazon EMR, and Amazon EMR Serverless connections. Available when assuming another role. |
Use Specific Region | Specify the AWS region or endpoint to connect to. When cleared, the connection uses the Amazon S3 default global endpoint, s3.amazonaws.com. Available only for Amazon S3 connections. |
Region | AWS region to connect to. Select one of the available regions. To specify an endpoint to connect to, select Other. |
Endpoint | Endpoint to connect to when you select Other for the region. Enter the endpoint name. |
Use Custom Endpoint | Specify a specific signing region when connecting to a custom
endpoint. When cleared, the connection uses the region specified in the endpoint. Available only for Amazon S3 connections, when using Data Collector 4.4.0 or later. |
Signing Region | AWS region used by the custom endpoint. |
Advanced Properties
- Amazon Kinesis Firehose
- Amazon Kinesis Streams
- Amazon S3
- Amazon SQS
Advanced Property | Description |
---|---|
Connection Timeout | Seconds to wait for a response before closing the connection. |
Socket Timeout | Seconds to wait for a response to a query. |
Retry Count | Maximum number of times to retry requests. |
Use Proxy | Specifies whether to use a proxy to connect. |
Proxy Host | Proxy host. |
Proxy Port | Proxy port. |
Proxy User | User name for proxy credentials. |
Proxy Password | Password for proxy
credentials. Tip: To secure sensitive
information, you can use credential stores or runtime
resources. |
Proxy Domain | Optional domain name for the proxy server. |
Proxy Workstation | Optional workstation for the proxy server. |
Amazon Redshift Properties
Redshift Property | Description |
---|---|
Redshift Endpoint | Amazon Redshift endpoint to use. |
Credential Property | Description |
---|---|
Security | Authentication method used to connect to Amazon Web Services
(AWS):
|
Access Key ID | AWS access key ID. Required when using AWS keys to authenticate with AWS. |
Secret Access Key | AWS secret access key. Required when
using AWS keys to authenticate with AWS. Tip: To secure sensitive
information, you can use credential stores or runtime
resources. |
DB User | Database user that Transformer impersonates when writing to the database. The user must have write permission for the database table. |
DB Password | Password for the database user
account. Available when using Instance Profile security. |
IAM Role for Unload to S3 | ARN of the IAM role assigned to the
Redshift cluster. Transformer uses the role to write to the specified S3 staging location.
The role must have write permission for the S3 staging
location. Available when using Instance Profile security. |
Auto-Create DB User | Enables creating a database user to
write data to Redshift. Available when using AWS Keys security. |
DB Groups | Comma-delimited list of existing
database groups for the database user to join for the duration
of the pipeline run. The specified groups must have write
permission for the S3 staging location. Available when using AWS Keys security. |
EMR Cluster Manager Properties
EMR Property | Description |
---|---|
S3 Staging URI | Amazon S3 bucket and path used to store
the Transformer
resources and files needed to run the pipeline. The specified bucket and path must exist before you run the pipeline. Use the
following format:
|
Define Cluster Start Option | Defines how the pipeline accesses the cluster to run the
pipeline:
Note: This property and provisioning
a cluster with AWS Service Catalog are available with
authoring Data Collector 5.10.0 and later. Use the connection with Transformer 5.7.0 or later.
|
Provision a New Cluster | Transformer provisions a new cluster to run the pipeline. When this property
is cleared, Transformer uses the specified existing cluster. Tip: Terminating a provisioned cluster after the pipeline stops is a
cost-effective method of running a Transformer pipeline. Available with authoring Data Collector 5.9.x and earlier. Use the connection with Transformer 5.6.x or earlier. For more information about running a pipeline on a provisioned cluster, see the Transformer documentation, version 5.6.x or earlier. |
Cluster by Name and Tags | Enables
specifying a cluster by cluster name and tags instead of by cluster
ID. Available when using an existing cluster and when using an authoring Data Collector 5.4.0 or later. Use the connection with Transformer 5.4.0 or later. For more information, see the Transformer documentation. |
Cluster Name | Name of the existing cluster to run the pipeline. This property is
case-sensitive. Available when specifying an existing cluster by name and tags. |
Cluster Tags | Tag name and values to use to differentiate between multiple clusters
with the specified cluster name. Click Add to define a tag. Click Add Another to define additional tags. Available when specifying an existing cluster by name and tags. |
Cluster ID | ID of the existing
cluster to run the pipeline.
For more information, see the Transformer documentation. |
EMR Version | EMR cluster version to provision. Available for clusters provisioned by Transformer. |
Cluster Name Prefix | Prefix for the name of the provisioned EMR cluster. Available for clusters provisioned by Transformer. |
Terminate Cluster | Terminates the provisioned cluster when the pipeline stops. When cleared, the cluster remains active after the pipeline stops. Available for clusters provisioned by Transformer. |
Logging Enabled | Enables copying log data to a
specified Amazon S3 location. Use to preserve log data that would
otherwise become unavailable when the provisioned cluster
terminates. Available for clusters provisioned by Transformer. |
S3 Log URI | Location in Amazon S3 to store pipeline log data. Location must be
unique for each pipeline. Use the following format:
The bucket must exist before you start the pipeline. Available when you enable logging for a cluster provisioned by Transformer. |
Service Role | EMR role used by the Transformer EC2 instance to provision resources
and perform other service-level tasks. Default is
Available for clusters provisioned by Transformer. |
Job Flow Role | EMR role for the EC2 instances within the cluster used to perform
pipeline tasks. Default is Available for clusters provisioned by Transformer. |
SSH EC2 Key ID | SSH key used to access the EMR cluster nodes. Transformer does not use or require an SSH key to access the nodes. Enter an SSH key ID if you plan to connect to the nodes using SSH for monitoring or troubleshooting purposes. For more information about using SSH keys to access EMR cluster nodes, see the Amazon EMR documentation. Available for clusters provisioned by Transformer. |
Visible to All Users | Enables all AWS Identity and Access Management (IAM) users under your
account to access the provisioned cluster. Available for clusters provisioned by Transformer. |
EC2 Subnet ID | EC2 subnet identifier to launch the provisioned cluster
in. Available for clusters provisioned by Transformer. |
Primary Security Group | ID of the security group on the primary node in the cluster. Note: Verify that the primary security group allows Transformer to
access the primary node in the EMR cluster. For information on
configuring security groups for EMR clusters, see the Amazon EMR
documentation.
Available for clusters provisioned by Transformer. |
Secondary Security Group | Security group ID for the secondary nodes in the cluster. Available for clusters provisioned by Transformer. |
Instance Count | Number of EC2 instances to use. Each instance corresponds to a
secondary node in the EMR cluster. Minimum is 2. Using an additional instance for each partition can improve pipeline performance. Available for clusters provisioned by Transformer. |
Primary Instance Type | EC2 instance type for the primary node in the EMR cluster. If an instance type does not display in the list, select Custom and then enter the instance type. Available for clusters provisioned by Transformer. |
Primary Instance Type (Custom) | Custom EC2 instance type for the primary node. Available when you select Custom for the Primary Instance Type property. |
Secondary Instance Type | EC2 instance type for the secondary nodes in the EMR cluster. If an instance type does not display in the list, select Custom and then enter the instance type. Available for clusters provisioned by Transformer. |
Secondary Instance Type (Custom) | Custom EC2 instance type for the primary node. Available when you select Custom for the Secondary Instance Type property. |
AWS Cluster Tags | AWS tags assigned to each EMR cluster provisioned. For each tag,
specify a tag name or key and a tag value. Available for clusters provisioned by Transformer. |
Provisioned Product Name | Name for the provisioned cluster. Available for clusters provisioned by AWS Service Catalog when not generating the product name. |
Generate Product Name | Transformer
generates the product name for the cluster to be provisioned by AWS
Service Catalog. Available for clusters provisioned by AWS Service Catalog. |
Project ID | Project ID for the cluster to be provisioned by AWS Service Catalog.
Available for clusters provisioned by AWS Service Catalog. |
Version Name | Version name for the cluster to be provisioned by AWS Service
Catalog. Available for clusters provisioned by AWS Service Catalog. |
Parameters | Optional parameter names and values to pass to AWS Service Catalog.
They must correspond to parameters allowed
by your product template. For more information, see the AWS
documentation. Available for clusters provisioned by AWS Service Catalog. |
Terminate Provisioned Product | Terminates the provisioned cluster and associated AWS Service Catalog
product when the pipeline stops. Available for clusters provisioned by AWS Service Catalog. |
Max Retries | Maximum number of times to retry a failed request or throttling error. |
Retry Base Delay | Base delay in milliseconds for retrying after a failed request. The specified number is doubled for each subsequent retry, up to the value specified for the Max Backoff property. |
Throttling Retry Base Delay | Base delay in milliseconds for retrying after a throttling error. The specified number is doubled for each subsequent retry, up to the value specified for the Max Backoff property. |
Max Backoff | The maximum number of milliseconds to wait between retries. Limits the delay between retries after failed requests and throttling errors. |
EMR Serverless Properties
EMR Serverless Property | Description |
---|---|
S3 Staging URI | Amazon S3 bucket and path used to store
the Transformer
resources and files needed to run the pipeline. The specified bucket and path must exist before you run the pipeline. Use the
following format:
|
Create a New Application | Creates a new EMR Serverless application to run the pipeline. |
Application by Name and Tags | Enables
specifying an application by application name and tags instead
of by application ID. Available when not creating a new application and when using an authoring Data Collector version 5.4.0 or later. Use the connection with Transformer 5.4.0 or later. For more information, see the Transformer documentation. |
EMR Application Name | Name of the existing application to run the pipeline. This
property is case-sensitive. Available when specifying an application by name and tags. |
EMR Application Tags | Tag name and values to use to differentiate between multiple
applications with the specified application name. Click Add to define a tag. Click Add Another to define additional tags. Available when specifying an application by name and tags. |
Application ID | ID of an existing EMR Serverless application
to run the pipeline. Available when not creating a new application and when not specifying an application by name and tags. For more information, see the Transformer documentation. |
Runtime Role ARN | Identity and Access Management (IAM) role used by the job. The
role must have access to the data sources, targets, scripts, and
libraries that the job uses. Enter the role in the following format:
Where |
EMR Version | Amazon EMR version of created application. Available when creating a new application. |
Application Name Prefix | Prefix automatically added to the EMR Serverless application
name. A prefix can help you identify the created applications in the EMR Studio console. Available when creating a new application. |
Stop Application | Stops the EMR Serverless application when the pipeline
stops. Available when creating a new application. |
Subnet IDs | IDs of the subnets that contain Transformer and
the origin and destination systems configured in the pipeline.
Specify subnets in the virtual private cloud (VPC) where the EMR
Serverless application resides. Available when creating a new application. |
Security Group IDs | ID of one or more security groups that can communicate with Transformer and
the origin and destination systems configured in the pipeline.
Available when creating a new application. |
Maximum CPU (vCPU) | Maximum number of vCPUs that the application can scale
to. Available when creating a new application. |
Maximum Memory (GB) | Maximum memory, specified in GB, that the application can scale
to. Available when creating a new application. |
Maximum Disk (GB) | Maximum disk size, specified in GB, that the application can
scale to. Available when creating a new application. |
Logging Enabled | Copies log files from job runs to a specified Amazon S3 location. Select to provide access to log data after the application stops. |
S3 Log URI | Location in Amazon S3 to store pipeline log data. Location must
be unique for each pipeline. Use the following format:
The bucket must exist before you start the pipeline. Available when Logging Enabled is selected. |
AWS Tags | AWS tags assigned to all applications and job runs created. For each tag, specify a tag name or key and a tag value. |
Max Retries | Maximum number of times to retry a failed request or throttling error. |
Retry Base Delay | Base delay in milliseconds for retrying after a failed request. The specified number is doubled for each subsequent retry, up to the value specified for the Max Backoff property. |
Throttling Retry Base Delay | Base delay in milliseconds for retrying after a throttling error. The specified number is doubled for each subsequent retry, up to the value specified for the Max Backoff property. |
Max Backoff | The maximum number of milliseconds to wait between retries. Limits the delay between retries after failed requests and throttling errors. |