Amazon S3

The Amazon S3 destination writes objects to Amazon S3. The Amazon S3 destination writes data based on the specified data format and creates a separate object for every partition.

Before you run a pipeline that uses the Amazon S3 destination, make sure to complete the prerequisite tasks.

When you configure the Amazon S3 destination, you specify the authentication method to use. You can specify Amazon S3 server-side encryption for the data. You can also use a connection to configure the destination.

You specify the output location and write mode to use. You can configure the destination to group partition objects by field values. If you configure the destination to overwrite related partitions, you must configure Spark to overwrite partitions dynamically. You can also configure the destination to drop unrelated master records when using the destination as part of a slowly changing dimension pipeline.

You select the data format to write and configure related properties.

You can also configure advanced properties such as performance-related properties and proxy server properties.

Prerequisites

Before writing to Amazon S3 with the Amazon S3 destination, complete the following prerequisite tasks:
Verify permissions
The user associated with the authentication credentials in effect must have WRITE permission on the S3 bucket.
Perform prerequisite tasks for local pipelines

To connect to Amazon S3, Transformer uses connection information stored in a Hadoop configuration file. Before you run a local pipeline that connects to Amazon S3, complete the prerequisite tasks.

URI Scheme

You can use the s3 or s3a URI scheme when you specify the bucket to write to. The URI scheme determines the underlying client that the destination uses to write to Amazon S3.

While both URI schemes are supported for EMR clusters, Amazon recommends using the s3 URI scheme with EMR clusters for better performance, security, and reliability. For all other clusters, use the s3a URI scheme.

For more information, see the Amazon documentation.

Authentication Method

You can configure the Amazon S3 destination to authenticate with Amazon Web Services (AWS) using an instance profile or AWS access keys. When accessing a public bucket, you can connect anonymously using no authentication.

For more information about the authentication methods and details on how to configure each method, see Amazon Security.

Server-Side Encryption

You can configure the destination to use Amazon Web Services server-side encryption (SSE) to protect data written to Amazon S3. When configured for server-side encryption, the destination passes required server-side encryption configuration values to Amazon S3. Amazon S3 uses the values to encrypt the data as it is written to Amazon S3.

When you enable server-side encryption for the destination, you select one of the following ways that Amazon S3 manages the encryption keys:
Amazon S3-Managed Encryption Keys (SSE-S3)
When you use server-side encryption with Amazon S3-managed keys, Amazon S3 manages the encryption keys for you.
AWS KMS-Managed Encryption Keys (SSE-KMS)
When you use server-side encryption with AWS Key Management Service (KMS), you specify the Amazon resource name (ARN) of the AWS KMS master encryption key that you want to use.
Customer-Provided Encryption Keys (SSE-C)
When you use server-side encryption with customer-provided keys, you specify the Base64 encoded 256-bit encryption key.

For more information about using server-side encryption to protect data in Amazon S3, see the Amazon S3 documentation.

Write Mode

The write mode determines how the Amazon S3 destination writes objects to Amazon S3. When writing objects, the resulting names are based on the selected data format.

The Amazon S3 destination includes the following write modes:
Overwrite files
Before writing any data from a batch, the destination removes all objects from the specified bucket and path.
Overwrite related partitions
Before writing data from a batch, the destination removes the objects from subfolders for which a batch has data. The destination leaves a subfolder intact if the batch has no data for that subfolder.
For example, say you have a bucket with ten folders. If the processed data belongs in two folders, the destination overwrites the two folders with the new data. The other eight folders remain unchanged.
Use this mode to overwrite select objects, like when writing to a slowly changing grouped file dimension.

The Partition by Fields property specifies the fields used to create subfolders. With no specified fields, the destination creates no subfolders and this mode works the same as the Overwrite Files mode.

Note: This mode requires you to configure Spark to overwrite partitions dynamically.
Write new files to new directory
When the pipeline starts, the destination writes objects in the specified bucket and path. Before writing data for each subsequent batch during the pipeline run, the destination removes any objects from the bucket and path. The destination generates an error if the specified bucket and path contains objects when you start the pipeline.
Write new or append to existing files
The destination creates new objects if they do not exist or appends data to an existing object if the object exists in the same bucket and path.

Spark Requirement to Overwrite Related Partitions

If you set Write Mode to Overwrite Related Partitions, you must configure Spark to overwrite partitions dynamically.

Set the Spark configuration property spark.sql.sources.partitionOverwriteMode to dynamic.

You can configure the property in Spark, or you can configure the property in individual pipelines. For example, you might set the property to dynamic in Spark when you plan to enable the Overwrite Related Partitions mode in most of your pipelines, and then set the property to static in individual pipelines that do not use that mode.

To configure the property in an individual pipeline, add an extra Spark configuration property on the Cluster tab of the pipeline properties.

Partition Objects

Pipelines process data in partitions. The Amazon S3 destination writes objects, which contain the processed data, in the configured bucket and path. The destination writes one object for each partition.

The destination groups the partition objects if you specify one or more fields in the Partition by Fields property. For each unique value of the fields specified in the Partition by Fields property, the destination creates a folder. In each folder, the destination writes one object for each partition that has the corresponding field and value. Because the folder name includes the field name and value, the object omits that data. With grouped partition objects, you can more easily find data with certain values, such as all the data for a particular city.

To overwrite folders that have updated data and leave other folders intact, set the Write Mode property to Overwrite Related Partitions. Then, the destination clears affected folders before writing, replacing objects in those folders with new objects.

If the Partition by Fields property lists no fields, the destination does not group partition objects and writes one object for each partition directly in the configured bucket and path.

Because the text data format only contains data from one field, the destination does not group partition objects for the text data format. Do not configure the Partition by Fields property if the destination writes in the text data format.

Example: Grouping Partition Objects

Suppose your pipeline processes orders. You want to write data in the Sales bucket and group the data by cities. Therefore, in the Bucket and Path property, you enter Sales, and in the Partition by Fields property, you enter City. Your pipeline is a batch pipeline that only processes one batch. You want to overwrite the entire Sales bucket each time you run the pipeline. Therefore, in the Write Mode property, you select Overwrite Files.

As the pipeline processes the batch, the origins and processors lead Spark to split the data into three partitions. The batch contains two values in the City field: Oakland and Fremont. Before writing the processed data in the Sales bucket, the destination removes any existing objects and folders and then creates two folders, City=Oakland and City=Fremont. In each folder, the destination writes one object for each partition that contains data for that city, as shown below:

Note that the written objects do not include the City field; instead, you infer the city from the folder name. The first partition does not include any data for Fremont; therefore, the City=Fremont folder does not contain an object from the first partition. Similarly, the third partition does not include any data for Oakland; therefore, the City=Oakland folder does not contain an object from the third partition.

Data Formats

The Amazon S3 destination writes records based on the specified data format.

The destination can write using the following data formats:
Avro
The destination writes an object for each partition and includes the Avro schema in each object.
Objects use the following naming convention:
part-<multipart partition number>.avro
When you configure the destination, you must specify the Avro option appropriate for the version of Spark to run the pipeline: Spark 2.3 or Spark 2.4 or later.
Delimited
The destination writes an object for each partition. It creates a header line for each file and uses \n as the newline character. You can specify a custom delimiter, quote, and escape character to use in the data.
Objects use the following naming convention:
part-<multipart partition number>.csv
JSON
The destination writes an object for each partition and writes each record on a separate line. For more information, see the JSON Lines website.
Objects use the following naming convention:
part-<multipart partition number>.json
ORC
The destination writes an object for each partition.
Output files use the following naming convention:
part-<multipart partition number>.snappy.orc
Parquet
The destination writes an object for each partition and includes the Parquet schema in every object.
Objects use the following naming convention:
part-<multipart partition number>.snappy.parquet
Text
The destination writes an object for every partition and uses \n as the newline character.
The destination writes data from a single String field. You can specify the field in the record to use.
Objects use the following naming convention:
part-<multipart partition number>.txt
Do not configure the Partition by Fields property when the destination writes Text data.
XML
The destination writes an object for every partition. You specify the root and row tags to use in output files.
Output files use the following naming convention:
part-<5 digit number> 

Configuring an Amazon S3 Destination

Configure an Amazon S3 destination to write Amazon S3 objects. Before you run the pipeline, make sure to complete the prerequisite tasks.
  1. In the Properties panel, on the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
    Stage Library Stage library to use to connect to Amazon S3:
    • AWS cluster-provided libraries - The cluster where the pipeline runs has Apache Hadoop Amazon Web Services libraries installed, and therefore has all of the necessary libraries to run the pipeline.
    • AWS Transformer-provided libraries for Hadoop 2.7.7 - Transformer passes the necessary libraries with the pipeline to enable running the pipeline.

      Use when running the pipeline locally or when the cluster where the pipeline runs does not include the Amazon Web Services libraries required for Hadoop 2.7.7.

    • AWS Transformer-provided libraries for Hadoop 3.2.0 - Transformer passes the necessary libraries with the pipeline to enable running the pipeline.

      Use when running the pipeline locally or when the cluster where the pipeline runs does not include the Amazon Web Services libraries required for Hadoop 3.2.0.

    • AWS Transformer-provided libraries for Hadoop 3.3.4 - Transformer passes the necessary libraries with the pipeline to enable running the pipeline.

      Use when running the pipeline locally or when the cluster where the pipeline runs does not include the Amazon Web Services libraries required for Hadoop 3.3.4.

    Note: When using additional Amazon stages in the pipeline, ensure that they use the same stage library.
  2. On the Amazon S3 tab, configure the following properties:
    Amazon S3 Property Description
    Connection Connection that defines the information required to connect to an external system.

    To connect to an external system, you can select a connection that contains the details, or you can directly enter the details in the pipeline. When you select a connection, Control Hub hides other properties so that you cannot directly enter connection details in the pipeline.

    To create a new connection, click the Add New Connection icon: . To view and edit the details of the selected connection, click the Edit Connection icon: .

    Authentication Method Authentication method used to connect to Amazon Web Services (AWS):
    • AWS Keys - Authenticates using an AWS access key pair.
    • Instance Profile - Authenticates using an instance profile associated with the Transformer EC2 instance.
    • None - Connects to a public bucket using no authentication.
    Access Key ID AWS access key ID. Required when using AWS keys to authenticate with AWS.
    Secret Access Key AWS secret access key. Required when using AWS keys to authenticate with AWS.
    Tip: To secure sensitive information, you can use credential stores or runtime resources.
    Assume Role Temporarily assumes another role to authenticate with AWS.
    Important: Transformer supports assuming another role when the pipeline meets the stage library and cluster type requirements.
    Role Session Name

    Optional name for the session created by assuming a role. Overrides the default unique identifier for the session.

    Available when assuming another role.

    Role ARN

    Amazon resource name (ARN) of the role to assume, entered in the following format:

    arn:aws:iam::<account_id>:role/<role_name>

    Where <account_id> is the ID of your AWS account and <role_name> is the name of the role to assume. You must create and attach an IAM trust policy to this role that allows the role to be assumed.

    Available when assuming another role.

    Session Timeout

    Maximum number of seconds for each session created by assuming a role. The session is refreshed if the pipeline continues to run for longer than this amount of time.

    Set to a value between 3,600 seconds and 43,200 seconds.

    Available when assuming another role.

    Set Session Tags

    Sets a session tag to record the name of the currently logged in StreamSets user that starts the pipeline or the job for the pipeline. AWS IAM verifies that the user account set in the session tag can assume the specified role.

    Select only when the IAM trust policy attached to the role to be assumed uses session tags and restricts the session tag values to specific user accounts.

    When cleared, the connection does not set a session tag.

    Available when assuming another role.

    External ID External ID included in an IAM trust policy that allows the specified role to be assumed.

    Available when assuming another role.

    Use Specific Region Specify the AWS region or endpoint to connect to.

    When cleared, the stage uses the Amazon S3 default global endpoint, s3.amazonaws.com.

    Region AWS region to connect to. Select one of the available regions. To specify an endpoint to connect to, select Other.
    Endpoint Endpoint to connect to when you select Other for the region. Enter the endpoint name.
    Bucket and Path Location to write objects. Include an Amazon S3 bucket name with an optional folder path.
    Specify the location with the s3 or s3a URI scheme, as follows:
    • For EMR clusters:
      s3://<bucket name>/<path to objects>/
    • For all other clusters:
      s3a://<bucket name>/<path to objects>/

    The instance profile or AWS access key pair used to authenticate with AWS must have write access to the bucket.

    The specified bucket must exist before you run the pipeline.

    Write Mode Mode to write objects:
    • Overwrite files - Removes all objects from the specified bucket and path.
    • Overwrite related partitions - Removes the objects from subfolders for which a batch has data.
    • Write new or append to existing files - Creates new objects or appends data to an existing object.
    • Write new files to new directory - Writes objects in a new bucket and path.
    Server-Side Encryption Option Option that Amazon S3 uses to manage encryption keys for server-side encryption:
    • None - Do not use server-side encryption.
    • SSE-S3 - Use Amazon S3-managed keys.
    • SSE-KMS - Use Amazon Web Services KMS-managed keys.
    • SSE-C - Use customer-provided keys.

    Default is SSE-S3.

    AWS KMS Key ARN Amazon resource name (ARN) of the AWS KMS master encryption key that you want to use. Use the following format:
    arn:<partition>:kms:<region>:<account-id>:key/<key-id>

    For example: arn:aws:kms:us-west-2:111122223333:key/1234abcd-12ab-34cd-56ef-1234567890ab

    Tip: To secure sensitive information, you can use credential stores or runtime resources.

    Used for SSE-KMS encryption only.

    Customer Encryption Key The 256-bit Base64 encoded encryption key to use.
    Tip: To secure sensitive information, you can use credential stores or runtime resources.

    Used for SSE-C encryption only.

    Partition by Fields Fields used to group partition objects. The destination creates a subfolder for each unique field value and writes an object in the subfolder for each processed partition that contains the corresponding field and value.

    In a file dimension pipeline that writes to a grouped file dimension, specify the fields that are listed in the Key Fields property of the Slowly Changing Dimension processor.

    Exclude Unrelated SCD Master Records Does not write slowly changing dimension master records that are unrelated to the change data. Writes only the master records related to the change data.

    Selecting this property and setting Write Mode to Overwrite Related Partitions results in the destination overwriting only the objects with changes.

    Use only in a slowly changing dimension pipeline that writes to a grouped file dimension. In pipelines that write to an ungrouped file dimension, clear this option to include all master records.

    For guidelines on configuring slowly changing dimension pipelines, see Pipeline Configuration.

  3. On the Advanced tab, optionally configure the following properties:
    Advanced Property Description
    Additional Configuration

    Additional HDFS properties to pass to an HDFS-compatible file system. Specified properties override those in Hadoop configuration files.

    To add properties, click the Add icon and define the HDFS property name and value. You can use simple or bulk edit mode to configure the properties. Use the property names and values as expected by your version of Hadoop.

    Max Threads Maximum number of concurrent threads to use for parallel uploads.
    Buffer Hint

    TCP socket buffer size hint, in bytes.

    Default is 8192.

    Maximum Connections Maximum number of connections to Amazon.

    Default is 1.

    Connection Timeout Seconds to wait for a response before closing the connection.
    Socket Timeout Seconds to wait for a response to a query.
    Retry Count Maximum number of times to retry requests.
    Use Proxy Specifies whether to use a proxy to connect.
    Proxy Host Proxy host.
    Proxy Port Proxy port.
    Proxy User User name for proxy credentials.
    Proxy Password Password for proxy credentials.
    Tip: To secure sensitive information, you can use credential stores or runtime resources.
    Proxy Domain Optional domain name for the proxy server.
    Proxy Workstation Optional workstation for the proxy server.
  4. On the Data Format tab, configure the following property:
    Data Format Property Description
    Data Formats Format of the data. Select one of the following formats:
    • Avro (Spark 2.4 or later) - For Avro data processed by Spark 2.4 or later.
    • Avro (Spark 2.3) - For Avro data processed by Spark 2.3.
    • Delimited
    • JSON
    • ORC
    • Parquet
    • Text
    • XML
  5. For delimited data, on the Data Format tab, configure the following property:
    Delimited Property Description
    Delimiter Character Delimiter character to use in the data. Select one of the available options or select Other to enter a custom character.

    You can enter a Unicode control character using the format \uNNNN, where ​ N is a hexadecimal digit from the numbers 0-9 or the letters A-F. For example, enter \u0000 to use the null character as the delimiter or \u2028 to use a line separator as the delimiter.

    Quote Character Quote character to use in the data.
    Escape Character Escape character to use in the data
  6. For text data, on the Data Format tab, configure the following property:
    Text Property Description
    Text Field String field in the record that contains the data to be written. All data must be incorporated into the specified field.
  7. For XML data, on the Data Format tab, configure the following properties:
    XML Property Description
    Root Tag Tag to use as the root element.

    Default is ROWS, which results in a <ROWS> root element.

    Row Tag Tag to use as a record delineator.

    Default is ROW, which results in a <ROW> record delineator element.