Amazon
Amazon EMR Cluster Manager Connection
Available when using an authoring Data Collector version 3.19.0 or later.
To create an Amazon EMR Cluster Manager connection, the EMR with Hadoop stage library,
streamsets-datacollector-emr_hadoop_<version>-lib
, must be
installed on the selected authoring Data Collector.
For a description of the Amazon EMR Cluster Manager connection properties, see Amazon Connection Properties.
Engine | Location |
---|---|
Transformer 3.16.0 or later | Pipeline configured to run on an Amazon EMR cluster |
Amazon EMR Serverless Connection
Available when using an authoring Data Collector version 5.3.0 or later.
To create an Amazon EMR Serverless connection, the EMR with Hadoop stage library,
streamsets-datacollector-emr_hadoop_<version>-lib
, must be
installed on the selected authoring Data Collector.
For a description of the Amazon EMR Serverless connection properties, see Amazon Connection Properties.
Engine | Location |
---|---|
Transformer 5.3.0 or later | Pipeline configured to run on an Amazon EMR Serverless application |
Amazon Kinesis Firehose Connection
Available when using an authoring Data Collector version 3.19.0 or later.
To create an Amazon Kinesis Firehose connection, the Amazon Kinesis stage library,
streamsets-datacollector-kinesis-lib
, must be installed on the
selected authoring Data Collector.
For a description of the Amazon Kinesis Firehose connection properties, see Amazon Connection Properties.
Engine | Stage |
---|---|
Data Collector 3.19.0 or later | Kinesis Firehose destination |
Amazon Kinesis Streams Connection
Available when using an authoring Data Collector version 3.19.0 or later.
To create an Amazon Kinesis Streams connection, the Amazon Kinesis stage library,
streamsets-datacollector-kinesis-lib
, must be installed on the
selected authoring Data Collector.
For a description of the Amazon Kinesis Streams connection properties, see Amazon Connection Properties.
Engine | Stages and Locations |
---|---|
Data Collector 3.19.0 or later |
|
Amazon Redshift Connection
Available when using an authoring Data Collector version 4.1.0 or later.
To create an Amazon Redshift connection, the Amazon Web Services stage library,
streamsets-datacollector-aws-lib
, must be installed on the selected
authoring Data Collector.
For a description of the Amazon Redshift connection properties, see Amazon Redshift Properties.
Engine | Stages and Locations |
---|---|
Transformer 4.1.0 or later |
|
Amazon S3 Connection
Available when using an authoring Data Collector version 3.19.0 or later.
To create an Amazon S3 connection, the Amazon Web Services stage library,
streamsets-datacollector-aws-lib
, must be installed on the selected
authoring Data Collector.
For a description of the Amazon S3 connection properties, see Amazon Connection Properties.
Engine | Stages and Locations |
---|---|
Data Collector 3.19.0 or later |
|
Transformer 3.16.0 or later |
|
Amazon SQS Connection
Available when using an authoring Data Collector version 3.19.0 or later.
To create an Amazon SQS connection, the Amazon Web Services stage library,
streamsets-datacollector-aws-lib
, must be installed on the selected
authoring Data Collector.
For a description of the Amazon SQS connection properties, see Amazon Connection Properties.
Engine | Stage |
---|---|
Data Collector 3.19.0 or later | Amazon SQS Consumer origin |
Amazon Security
You can configure Amazon connections to use one of the following authentication methods to connect securely to Amazon Web Services (AWS):
- Instance profile
- When the execution engine - Data Collector or Transformer - runs on an Amazon EC2 instance that has an associated instance profile, the engine uses the instance profile credentials to automatically authenticate with AWS.
- AWS access keys
- When the execution engine does not run on an Amazon EC2 instance or when the EC2 instance doesn’t have an instance profile, you can authenticate using an AWS access key pair. When using an AWS access key pair, you specify the access key ID and secret access key to use.
Assume Another Role
When using instance profile or AWS access keys authentication, you can configure an Amazon connection to assume another IAM role.
When an Amazon connection assumes a role, it temporarily gives up the instance profile or IAM user permissions and uses the permissions assigned to the assumed role. To assume a role, the connection calls the AWS STS AssumeRole API operation and passes the role to use. The operation creates a new session with the temporary credentials, as long as the following conditions are true:
- The IAM policy attached to the current principal - the IAM role or user - grants permission to assume the specified role.
- The IAM trust policy attached to the role to be assumed permits the current principal to assume it.
- Assume a role with no restrictions
-
When configured to assume a role with no restrictions, any StreamSets user account that starts the pipeline can assume the role specified in the Amazon connection, as long as the IAM policies attached to the current principal and to the role to be assumed allow it.
For example, any Control Hub user who starts the job for the pipeline can assume the
finance
role when the IAM trust policy attached to thefinance
role allows the role to be assumed by the IAM role or user identified by the selected authentication method. - Assume a role using session tags to restrict role access
- For increased security, you can configure a connection to assume a role and
set session tags to restrict the user accounts allowed to assume the role.
When configured to set session tags, the connection passes the following
session tag to the AWS STS AssumeRole API operation:
streamsets/principal=<user>
Where
<user>
is the name of the currently logged in Data Collector or Transformer user that starts the pipeline or the Control Hub user that starts the job for the pipeline.AWS IAM verifies that the user account set in the session tag can assume the specified role. The IAM trust policy attached to the role to be assumed must allow the current principal permission to assume the role and must have constraints using IAM condition keys to limit the AssumeRole action based on the requested session tags.
For example, when the Control Hub user Joe starts the job for the pipeline, he can assume the
finance
role when the IAM trust policy attached to thefinance
role allows the userjoe
to assume the role. The Control Hub user Emily cannot assume thefinance
role because the trust policy attached to thefinance
role does not allow the useremily
to assume the role.
To configure an Amazon connection to assume a role, you first must create the trust policy in AWS that allows the role to be assumed. Then, you configure the required connection properties in Control Hub.
Create the Trust Policy
In AWS, create and attach a trust policy to the role to be assumed. The policy must allow other principals - IAM roles or users - to assume the role.
The trust policy that you create for the role to be assumed depends on whether you want to allow connections to assume the role with or without restrictions:
- Trust policy to assume the role with no restrictions
- Create and attach a trust policy to the role to be assumed that allows another IAM role or user to assume the role.
- Trust policy to assume a role using session tags to restrict role access
- Create and attach a trust policy to the role to be assumed that allows the IAM role or user to assume the role, uses session tags, and restricts the session tag values to specific StreamSets user accounts.
For more information about creating an IAM trust policy, see the AWS IAM documentation.
Configure Connections to Assume a Role
After you create and attach a trust policy to the role to be assumed, you can configure Amazon connections to assume the role.
-
On the primary tab of the Amazon connection, select AWS
Keys or Instance Profile for the
Authentication Method property.
Note: Assuming another role is not available for Amazon Redshift connections. For other connection types, Transformer version 3.18.x or later supports assuming another role when the pipeline meets the stage library and cluster type requirements. Data Collector version 3.19.x supports assuming another role only with instance profile authentication.
- Select Assume Role.
-
Configure the following properties:
Assume Role Property Description Role ARN Amazon resource name (ARN) of the role to assume, entered in the following format:
arn:aws:iam::<account_id>:role/<role_name>
Where
<account_id>
is the ID of your AWS account and<role_name>
is the name of the role to assume. You must create and attach an IAM trust policy to this role that allows the role to be assumed.Role Session Name Optional name for the session created by assuming a role. Overrides the default unique identifier for the session.
Session Timeout Maximum number of seconds for each session created by assuming a role. The session is refreshed if the pipeline continues to run for longer than this amount of time.
Set to a value between 3,600 seconds and 43,200 seconds.
Set Session Tags Sets a session tag to record the name of the currently logged in Data Collector or Transformer user that starts the pipeline or the Control Hub user that starts the job for the pipeline. AWS IAM verifies that the user account set in the session tag can assume the specified role.
Select only when the IAM trust policy attached to the role to be assumed uses session tags and restricts the session tag values to specific user accounts.
When cleared, the connection does not set a session tag.
Amazon Connection Properties
- Common properties - Used by all Amazon connections except Amazon Redshift.
- Advanced properties - Used by Amazon Kinesis Firehose, Kinesis Streams, S3, and SQS connections.
- Amazon Redshift properties - Used by Amazon Redshift connections.
- EMR cluster manager properties - Used by Amazon EMR Cluster Manager connections.
- EMR Serverless properties - Used by Amazon EMR Serverless connections.
Common Properties
Common properties are used by all Amazon connections, except the Amazon Redshift connection.
Common Property | Description |
---|---|
Authentication Method | Authentication method used to connect to Amazon Web Services
(AWS):
|
Access Key ID | AWS access key ID. Required when using AWS keys to authenticate with AWS. |
Secret Access Key | AWS secret access key. Required when
using AWS keys to authenticate with AWS. Tip: To secure sensitive
information, you can use credential stores or runtime resources. |
Assume Role | Temporarily assumes another role to
authenticate with AWS. Note: Assuming another role is not available for
Amazon Redshift connections. For other connection types, Transformer version 3.18.x or later supports assuming another role when
the pipeline meets the stage library and cluster type requirements.
Data Collector version 3.19.x
supports assuming another role only with instance profile authentication.
|
Role ARN |
Amazon resource name (ARN) of the role to assume, entered in the following format:
Where Available when assuming another role. |
Role Session Name |
Optional name for the session created by assuming a role. Overrides the default unique identifier for the session. Available when assuming another role. |
Session Timeout |
Maximum number of seconds for each session created by assuming a role. The session is refreshed if the pipeline continues to run for longer than this amount of time. Set to a value between 3,600 seconds and 43,200 seconds. Available when assuming another role. |
Set Session Tags |
Sets a session tag to record the name of the currently logged in Data Collector or Transformer user that starts the pipeline or the Control Hub user that starts the job for the pipeline. AWS IAM verifies that the user account set in the session tag can assume the specified role. Select only when the IAM trust policy attached to the role to be assumed uses session tags and restricts the session tag values to specific user accounts. When cleared, the connection does not set a session tag. Available when assuming another role. |
Use Specific Region | Specify the AWS region or endpoint to connect to. When cleared, the connection uses the Amazon S3 default global endpoint, s3.amazonaws.com. Available only for Amazon S3 connections. |
Region | AWS region to connect to. Select one of the available regions. To specify an endpoint to connect to, select Other. |
Endpoint | Endpoint to connect to when you select Other for the region. Enter the endpoint name. |
Use Custom Endpoint | Specify a specific signing region when connecting to a custom
endpoint. When cleared, the connection uses the region specified in the endpoint. Available only for Amazon S3 connections, when using Data Collector 4.4.0 or later. |
Signing Region | AWS region used by the custom endpoint. |
Advanced Properties
- Amazon Kinesis Firehose
- Amazon Kinesis Streams
- Amazon S3
- Amazon SQS
Advanced Property | Description |
---|---|
Connection Timeout | Seconds to wait for a response before closing the connection. |
Socket Timeout | Seconds to wait for a response to a query. |
Retry Count | Maximum number of times to retry requests. |
Use Proxy | Specifies whether to use a proxy to connect. |
Proxy Host | Proxy host. |
Proxy Port | Proxy port. |
Proxy User | User name for proxy credentials. |
Proxy Password | Password for proxy
credentials. Tip: To secure sensitive
information, you can use credential stores or runtime resources. |
Proxy Domain | Optional domain name for the proxy server. |
Proxy Workstation | Optional workstation for the proxy server. |
Amazon Redshift Properties
Redshift Property | Description |
---|---|
Redshift Endpoint | Amazon Redshift endpoint to use. |
Credential Property | Description |
---|---|
Security | Authentication method used to connect to Amazon Web Services
(AWS):
|
Access Key ID | AWS access key ID. Required when using AWS keys to authenticate with AWS. |
Secret Access Key | AWS secret access key. Required when
using AWS keys to authenticate with AWS. Tip: To secure sensitive
information, you can use credential stores or runtime resources.
|
DB User | Database user that Transformer impersonates when writing to the database. The user must have write permission for the database table. |
DB Password | Password for the database user
account. Available when using Instance Profile security. |
IAM Role for Unload to S3 | ARN of the IAM role assigned to the
Redshift cluster. Transformer uses the role to write to the specified S3 staging location.
The role must have write permission for the S3 staging
location. Available when using Instance Profile security. |
Auto-Create DB User | Enables creating a database user to
write data to Redshift. Available when using AWS Keys security. |
DB Groups | Comma-delimited list of existing
database groups for the database user to join for the duration
of the pipeline run. The specified groups must have write
permission for the S3 staging location. Available when using AWS Keys security. |
EMR Cluster Manager Properties
EMR Property | Description |
---|---|
S3 Staging URI | Amazon S3 bucket and path used to store
the Transformer resources and files needed to run the pipeline. The specified bucket and path must exist before you run the pipeline. Use the following format:
|
Provision a New Cluster |
Provisions a new cluster to run the pipeline.
Tip: Provisioning a cluster that terminates after the pipeline stops is a
cost-effective method of running a Transformer pipeline. For more information about running a pipeline on a provisioned cluster, see Provisioned Cluster. |
Cluster by Name and Tags | Enables
specifying a cluster by cluster name and tags instead of by cluster
ID. Available when not provisioning a cluster and when using an authoring Data Collector version 5.4.0 or later. For more information, see Specifying a Cluster. |
Cluster Name | Name of the existing cluster to run the pipeline. This property is
case-sensitive. Available when specifying a cluster by name and tags. |
Cluster Tags | Tag name and values to use to differentiate between multiple clusters
with the specified cluster name. Click Add to define a tag. Click Add Another to define additional tags. Available when specifying a cluster by name and tags. |
Cluster ID | ID of the existing
cluster to run the pipeline.
For more information, see Specifying a Cluster. |
EMR Version | EMR cluster version to provision. Transformer supports version 5.20.0 or later 5.x versions. Available only for provisioned clusters. |
Cluster Name Prefix | Prefix for the name of the provisioned EMR cluster. Available only for provisioned clusters. |
Terminate Cluster | Terminates the provisioned cluster when the pipeline stops. When cleared, the cluster remains active after the pipeline stops. Available only for provisioned clusters. |
Logging Enabled | Enables copying log data to a
specified Amazon S3 location. Use to preserve log data that would
otherwise become unavailable when the provisioned cluster
terminates. Available only for provisioned clusters. |
S3 Log URI | Location in Amazon S3 to store pipeline log data. Location must be
unique for each pipeline. Use the following format:
The bucket must exist before you start the pipeline. Available when you enable logging for a provisioned cluster. |
Service Role | EMR role used by the Transformer EC2 instance to provision resources
and perform other service-level tasks. Default is
Available only for provisioned clusters. |
Job Flow Role | EMR role for the EC2 instances within the cluster used to perform
pipeline tasks. Default is Available only for provisioned clusters. |
SSH EC2 Key ID | SSH key used to access the EMR cluster nodes. Transformer does not use or require an SSH key to access the nodes. Enter an SSH key ID if you plan to connect to the nodes using SSH for monitoring or troubleshooting purposes. For more information about using SSH keys to access EMR cluster nodes, see the Amazon EMR documentation. Available only for provisioned clusters. |
Visible to All Users | Enables all AWS Identity and Access Management (IAM) users under your
account to access the provisioned cluster. Available only for provisioned clusters. |
EC2 Subnet ID | EC2 subnet identifier to launch the provisioned cluster
in. Available only for provisioned clusters. |
Master Security Group | ID of the security group on the master node in the cluster. Note: Verify that the master security group allows Transformer to
access the master node in the EMR cluster. For information on
configuring security groups for EMR clusters, see the Amazon EMR
documentation.
Available only for provisioned clusters. |
Slave Security Group | Security group ID for the slave nodes in the cluster. Available only for provisioned clusters. |
Instance Count | Number of EC2 instances to use. Each instance corresponds to a slave
node in the EMR cluster. Minimum is 2. Using an additional instance for each partition can improve pipeline performance. Available only for provisioned clusters. |
Master Instance Type | EC2 instance type for the master node in the EMR cluster. If an instance type does not display in the list, select Custom and then enter the instance type. Available only for provisioned clusters. |
Master Instance Type (Custom) | Custom EC2 instance type for the master node. Available when you select Custom for the Master Instance Type property. |
Slave Instance Type | EC2 instance type for the slave nodes in the EMR cluster. If an instance type does not display in the list, select Custom and then enter the instance type. Available only for provisioned clusters. |
Slave Instance Type (Custom) | Custom EC2 instance type for the master node. Available when you select Custom for the Slave Instance Type property. |
AWS Cluster Tags | AWS tags assigned to each EMR cluster provisioned. For each tag,
specify a tag name or key and a tag value. Available only for provisioned clusters. |
Max Retries | Maximum number of times to retry a failed request or throttling error. |
Retry Base Delay | Base delay in milliseconds for retrying after a failed request. The specified number is doubled for each subsequent retry, up to the value specified for the Max Backoff property. |
Throttling Retry Base Delay | Base delay in milliseconds for retrying after a throttling error. The specified number is doubled for each subsequent retry, up to the value specified for the Max Backoff property. |
Max Backoff | The maximum number of milliseconds to wait between retries. Limits the delay between retries after failed requests and throttling errors. |
EMR Serverless Properties
EMR Serverless Property | Description |
---|---|
S3 Staging URI | Amazon S3 bucket and path used to store
the Transformer resources and files needed to run the pipeline. The specified bucket and path must exist before you run the pipeline. Use the following format:
|
Create a New Application | Creates a new EMR Serverless application to run the pipeline. |
Application by Name and Tags | Enables
specifying an application by application name and tags instead
of by application ID. Available when not creating a new application and when using an authoring Data Collector version 5.4.0 or later. For more information, see Specifying an Application. |
EMR Application Name | Name of the existing application to run the pipeline. This
property is case-sensitive. Available when specifying an application by name and tags. |
EMR Application Tags | Tag name and values to use to differentiate between multiple
applications with the specified application name. Click Add to define a tag. Click Add Another to define additional tags. Available when specifying an application by name and tags. |
Application ID | ID of an existing EMR Serverless application
to run the pipeline. Available when not creating a new application and when not specifying an application by name and tags. For more information, see Specifying an Application. |
Runtime Role ARN | Identity and Access Management (IAM) role used by the job. The
role must have access to the data sources, targets, scripts, and
libraries that the job uses. Enter the role in the following format:
Where |
EMR Version | Amazon EMR version of created application. Transformer
supports version 6.9.0 and later 6.x. Available when creating a new application. |
Application Name Prefix | Prefix automatically added to the EMR Serverless application
name. A prefix can help you identify the created applications in the EMR Studio console. Available when creating a new application. |
Stop Application | Stops the EMR Serverless application when the pipeline
stops. Available when creating a new application. |
Subnet IDs | IDs of the subnets that contain Transformer and
the origin and destination systems configured in the pipeline.
Specify subnets in the virtual private cloud (VPC) where the EMR
Serverless application resides. Available when creating a new application. |
Security Group IDs | ID of one or more security groups that can communicate with Transformer and
the origin and destination systems configured in the pipeline.
Available when creating a new application. |
Maximum CPU (vCPU) | Maximum number of vCPUs that the application can scale
to. Available when creating a new application. |
Maximum Memory (GB) | Maximum memory, specified in GB, that the application can scale
to. Available when creating a new application. |
Maximum Disk (GB) | Maximum disk size, specified in GB, that the application can
scale to. Available when creating a new application. |
Logging Enabled | Copies log files from job runs to a specified Amazon S3 location. Select to provide access to log data after the application stops. |
S3 Log URI | Location in Amazon S3 to store pipeline log data. Location must
be unique for each pipeline. Use the following format:
The bucket must exist before you start the pipeline. Available when Logging Enabled is selected. |
AWS Tags | AWS tags assigned to all applications and job runs created. For each tag, specify a tag name or key and a tag value. |
Max Retries | Maximum number of times to retry a failed request or throttling error. |
Retry Base Delay | Base delay in milliseconds for retrying after a failed request. The specified number is doubled for each subsequent retry, up to the value specified for the Max Backoff property. |
Throttling Retry Base Delay | Base delay in milliseconds for retrying after a throttling error. The specified number is doubled for each subsequent retry, up to the value specified for the Max Backoff property. |
Max Backoff | The maximum number of milliseconds to wait between retries. Limits the delay between retries after failed requests and throttling errors. |