Amazon S3

The Amazon S3 executor performs a task in Amazon S3 each time it receives an event. For information about supported versions, see Supported Systems and Versions in the Data Collector documentation.

Upon receiving an event, the executor can perform one of the following tasks:
  • Create a new Amazon S3 object for the specified content
  • Copy an object under 5 GB to another location in the same bucket and optionally delete the original object
  • Adds tags to an existing object

Each Amazon S3 executor can perform one type of task. To perform additional tasks, use additional executors.

Use the Amazon S3 executor as part of an event stream. You can use the executor in any logical way, such as writing information from an event record to a new S3 object, or copying or tagging objects after they are written by the Amazon S3 destination.

When you configure the Amazon S3 executor, you specify the connection information, such as access keys, region, and bucket. You configure the expression that represents the object name and location. When creating new objects, you specify the content to place in the objects. When copying objects, you specify the location of the object and the location for the copy. You can also configure the executor to delete the original object after it is copied. When adding tags to an existing object, you specify the tags that you want to use.

You can configure the executor to use Amazon Web Services server-side encryption to protect the data written to Amazon S3. You can optionally use an HTTP proxy to connect to Amazon S3.

You can also use a connection to configure the executor.

You can also configure the executor to generate events for another event stream. For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.

Authentication Method

You can configure the Amazon S3 executor to authenticate with Amazon Web Services (AWS) using an instance profile or AWS access keys. When accessing a public bucket, you can connect anonymously using no authentication.

For more information about the authentication methods and details on how to configure each method, see Security in Amazon Stages.

Create New Objects

You can use the Amazon S3 executor to create new Amazon S3 objects and write the specified content to the object when the executor receives an event record.

When you create an object, you specify where to create the object and the content to write to the object. You can use an expression to represent both the location for the object and the content to use.

For example, say you want the executor to create a new Amazon S3 object for each object that the Amazon S3 destination writes, and to use the new object to store the record count information for each written object. Since the object-written event record includes the record count, you can enable the destination to generate records and route the event to the Amazon S3 executor.

The object-written event record includes the bucket and object key of the written object. So, to create a new record-count object in the same bucket as the written object, you can use the following expression for the Object property, as follows:
${record:value('/bucket')}/${record:value('/objectKey')}.recordcount
The event record also includes the number of records written to the object. So, to write this information to the new object, you can use the following expression for the Content property, as follows:
${record:value('/recordCount')}
Tip: Stage-generated event records differ from stage to stage. For a description of stage events, see "Event Record" in the documentation for the event-generating stage. For a description of pipeline events, see Pipeline Event Records.

Copy Objects

You can use the Amazon S3 executor to copy an object to another location within the same bucket when the executor receives an event record. You can optionally delete the original object after the copy. The object must be under 5 GB in size.

When you copy an object, you specify the location of the object to be copied, and the location for the copy. The target location must be within the same bucket as the original object. You can use an expression to represent both locations. You can also specify whether to delete the original object.

A simple example is to move each written object to a Completed directory after it is closed. To do this, you configure the Amazon S3 destination to generate events. Since the object-written event record includes the bucket and object key, you can use that information to configure the Object property, as follows:
${record:value('/bucket')}/${record:value('/objectKey')}
Then, to move the object to a Completed directory, retaining the same object name, you can configure the New Object Path property, as follows:
${record:value('/bucket')}/completed/${record:value('/objectKey')}

You can then select Delete Original Object to remove the original object.

To do something more complicated, like move only the subset of objects with a _west suffix to a different location, you can add a Stream Selector processor in the event stream to route only events where the /objectKey field includes a _west suffix to the Amazon S3 executor.

Tag Existing Objects

You can use the Amazon S3 executor to add tags to existing Amazon S3 objects. Tags are key-value pairs that you can use to categorize objects, such as product: <product>.

You can configure multiple tags. When you configure a tag, you can define a tag with just the key or specify a key and value.

For more information about tags, including Amazon S3 restrictions, see the Amazon S3 documentation.

Event Generation

The Amazon S3 executor can generate events that you can use in an event stream. When you enable event generation, the executor generates events each time it creates a new object, adds tags to an existing object, or completes copying an object to a new location.

Amazon S3 events can be used in any logical way. For example:

For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.

Event Records

Event records generated by the Amazon S3 executor have the following event-related record header attributes. Record header attributes are stored as String values.
Record Header Attribute Description
sdc.event.type Event type. Uses the following event types:
  • file-changed - Generated when the executor adds tags to an existing object.
  • file-created - Generated when the executor creates a new object.
  • file-moved - Generated when the executor completes copying an object to a new location.
sdc.event.version Integer that indicates the version of the event record type.
sdc.event.creation_timestamp Epoch timestamp when the stage created the event.
The Amazon S3 executor can generate the following types of event records:
file-changed

The executor generates a file-changed event record when it adds tags to an existing object.

File-changed event records have the sdc.event.type record header attribute set to file-changed and include the following field:
Event Field Name Description
object_key Key of the tagged object.
file-created

The executor generates a file-created event record when it creates a new object.

File-created event records have the sdc.event.type record header attribute set to file-created and include the following field:
Event Field Name Description
object_key Key of the created object.
file-moved

The executor generates a file-moved event record when it completes copying an object to a new location.

File-moved event records have the sdc.event.type record header attribute set to file-moved and include the following field:
Event Field Name Description
object_key Key of the copied object.

Server-Side Encryption

You can configure the stage to use Amazon Web Services server-side encryption (SSE) to protect data written to Amazon S3. When configured for server-side encryption, the stage passes required server-side encryption configuration values to Amazon S3. Amazon S3 uses the values to encrypt the data as it is written to Amazon S3.

When you enable server-side encryption for the stage, you select one of the following ways that Amazon S3 manages the encryption keys:
Amazon S3-Managed Encryption Keys (SSE-S3)
When you use server-side encryption with Amazon S3-managed keys, Amazon S3 manages the encryption keys for you.
AWS KMS-Managed Encryption Keys (SSE-KMS)
When you use server-side encryption with AWS Key Management Service (KMS), you specify the Amazon resource name (ARN) of the AWS KMS master encryption key that you want to use. You can also specify key-value pairs to use for the encryption context.
Customer-Provided Encryption Keys (SSE-C)
When you use server-side encryption with customer-provided keys, you specify the following information:
  • Base64 encoded 256-bit encryption key
  • Base64 encoded 128-bit MD5 digest of the encryption key using RFC 1321

For more information about using server-side encryption to protect data in Amazon S3, see the Amazon S3 documentation.

Configuring an Amazon S3 Executor

Configure an Amazon S3 executor to create new Amazon S3 objects or to add tags to existing objects.

  1. In the Properties panel, on the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
    Produce Events Generates event records when events occur. Use for event handling.
    Required Fields Fields that must include data for the record to be passed into the stage.
    Tip: You might include fields that the stage uses.

    Records that do not include all required fields are processed based on the error handling configured for the pipeline.

    Preconditions Conditions that must evaluate to TRUE to allow a record to enter the stage for processing. Click Add to create additional preconditions.

    Records that do not meet all preconditions are processed based on the error handling configured for the stage.

    On Record Error Error record handling for the stage:
    • Discard - Discards the record.
    • Send to Error - Sends the record to the pipeline for error handling.
    • Stop Pipeline - Stops the pipeline.
  2. On the Amazon S3 tab, configure the following properties:
    Amazon S3 Property Description
    Connection Connection that defines the information required to connect to an external system.

    To connect to an external system, you can select a connection that contains the details, or you can directly enter the details in the pipeline. When you select a connection, Control Hub hides other properties so that you cannot directly enter connection details in the pipeline.

    Authentication Method Authentication method used to connect to Amazon Web Services (AWS):
    • AWS Keys - Authenticates using an AWS access key pair.
    • Instance Profile - Authenticates using an instance profile associated with the Data Collector EC2 instance.
    • None - Connects to a public bucket using no authentication.
    Access Key ID AWS access key ID. Required when using AWS keys to authenticate with AWS.
    Secret Access Key AWS secret access key. Required when using AWS keys to authenticate with AWS.
    Tip: To secure sensitive information such as access key pairs, you can use runtime resources or credential stores. For more information about credential stores, see Credential Stores in the Data Collector documentation.
    Assume Role Temporarily assumes another role to authenticate with AWS.
    Role ARN

    Amazon resource name (ARN) of the role to assume, entered in the following format:

    arn:aws:iam::<account_id>:role/<role_name>

    Where <account_id> is the ID of your AWS account and <role_name> is the name of the role to assume. You must create and attach an IAM trust policy to this role that allows the role to be assumed.

    Available when assuming another role.

    Role Session Name

    Optional name for the session created by assuming a role. Overrides the default unique identifier for the session.

    Available when assuming another role.

    Session Timeout

    Maximum number of seconds for each session created by assuming a role. The session is refreshed if the pipeline continues to run for longer than this amount of time.

    Set to a value between 3,600 seconds and 43,200 seconds.

    Available when assuming another role.

    Set Session Tags

    Sets a session tag to record the name of the currently logged in StreamSets user that starts the pipeline or the job for the pipeline. AWS IAM verifies that the user account set in the session tag can assume the specified role.

    Select only when the IAM trust policy attached to the role to be assumed uses session tags and restricts the session tag values to specific user accounts.

    When cleared, the connection does not set a session tag.

    Available when assuming another role.

    Use Specific Region Specify the AWS region or endpoint to connect to.

    When cleared, the stage uses the Amazon S3 default global endpoint, s3.amazonaws.com.

    Region AWS region to connect to. Select one of the available regions. To specify an endpoint to connect to, select Other.
    Endpoint Endpoint to connect to when you select Other for the region. Enter the endpoint name.
    Use Custom Endpoint Specify a specific signing region when connecting to a custom endpoint.

    When cleared, the stage uses the region specified in the endpoint.

    Signing Region AWS region used by the custom endpoint.
    Bucket Bucket that contains the objects to be created, copied, or updated.
    Note: The bucket name must be DNS compliant. For more information about bucket naming conventions, see the Amazon S3 documentation.
    External ID External ID included in an IAM trust policy that allows the specified role to be assumed.

    Available when assuming another role.

  3. On the Tasks tab, configure the following properties:
    Task Property Description
    Task Task to perform upon receiving an event record. Select one of the following options:
    • Create New Object - Use to create a new S3 object with the configured content.
    • Copy Object - Use to copy a closed S3 object to another location in the same bucket.
    • Add Tags to Existing Object - Use to add tags to a closed S3 object.
    Object Path to the object to use. To use the object whose closure generated the event record, use the following expression:
    ${record:value('/objectKey')}
    To use a whole file whose closure generated the event record, use the following expression:
    ${record:value('/targetFileInfo/objectKey')}
    Content Content to write to new objects. You can use expressions to represent the content to use. For more information, see Create New Objects.
    New Object Path Path for the copied object. You can use expressions to represent the location and name of the object. For more information, see Copy Objects.
    Tags Tags to add to an existing object. Using simple or bulk edit mode, click Add Another to configure a tag.

    You can configure multiple tags. When you configure a tag, you can define a tag with just the key or specify a key and value.

  4. On the SSE tab, optionally enable server-side encryption:
    SSE Property Description
    Use Server-Side Encryption Enables server-side encryption.
    Server-Side Encryption Option Option that Amazon S3 uses to manage the encryption keys:
    • SSE-S3 - Use Amazon S3-managed keys.
    • SSE-KMS - Use Amazon Web Services KMS-managed keys.
    • SSE-C - Use customer-provided keys.

    Default is SSE-S3.

    AWS KMS Key ARN Amazon resource name (ARN) of the AWS KMS master encryption key. Use the following format:
    <arn>:<aws>:<kms>:<region>:<acct ID>:<key>/<key ID>

    Used for SSE-KMS encryption only.

    Encryption Context Key-value pairs to use for the encryption context. Click Add to add key-value pairs.

    Used for SSE-KMS encryption only.

    Customer Encryption Key The 256-bit and Base64 encoded encryption key to use.

    Used for SSE-C encryption only.

    Customer Encryption Key MD5 The 128-bit and Base64 encoded MD5 digest of the encryption key according to RFC 1321.

    Used for SSE-C encryption only.

  5. To use an HTTP proxy, on the Advanced tab, configure the following properties:
    Advanced Property Description
    Use Proxy Specifies whether to use a proxy to connect.
    Proxy Host Proxy host.
    Proxy Port Proxy port.
    Proxy User User name for proxy credentials.
    Proxy Password Password for proxy credentials.
    Tip: To secure sensitive information such as user names and passwords, you can use runtime resources or credential stores. For more information about credential stores, see Credential Stores in the Data Collector documentation.
    Proxy Domain Optional domain name for the proxy server.
    Proxy Workstation Optional workstation for the proxy server.