Amazon S3
The Amazon S3 destination writes objects to Amazon S3. The Amazon S3 destination writes data based on the specified data format and creates a separate object for every partition.
Before you run a pipeline that uses the Amazon S3 destination, make sure to complete the prerequisite tasks.
When you configure the Amazon S3 destination, you specify the authentication method to use. You can specify Amazon S3 server-side encryption for the data. You can also use a connection to configure the destination.
You specify the output location and write mode to use. You can configure the destination to group partition objects by field values. If you configure the destination to overwrite related partitions, you must configure Spark to overwrite partitions dynamically. You can also configure the destination to drop unrelated master records when using the destination as part of a slowly changing dimension pipeline.
You select the data format to write and configure related properties.
You can also configure advanced properties such as performance-related properties and proxy server properties.
Prerequisites
- Verify permissions
- The user associated with the authentication credentials in effect must have WRITE permission on the S3 bucket.
- Perform prerequisite tasks for local pipelines
-
To connect to Amazon S3, Transformer uses connection information stored in a Hadoop configuration file. Before you run a local pipeline that connects to Amazon S3, complete the prerequisite tasks.
URI Scheme
You can use the s3
or s3a
URI scheme when you specify
the bucket to write to. The URI scheme determines the underlying client that the
destination uses to write to Amazon S3.
While both URI schemes are supported for EMR clusters, Amazon recommends using the
s3
URI scheme with EMR clusters for better performance, security,
and reliability. For all other clusters, use the s3a
URI scheme.
For more information, see the Amazon documentation.
Authentication Method
You can configure the Amazon S3 destination to authenticate with Amazon Web Services (AWS) using an instance profile or AWS access keys. When accessing a public bucket, you can connect anonymously using no authentication.
For more information about the authentication methods and details on how to configure each method, see Amazon Security.
Server-Side Encryption
You can configure the destination to use Amazon Web Services server-side encryption (SSE) to protect data written to Amazon S3. When configured for server-side encryption, the destination passes required server-side encryption configuration values to Amazon S3. Amazon S3 uses the values to encrypt the data as it is written to Amazon S3.
- Amazon S3-Managed Encryption Keys (SSE-S3)
- When you use server-side encryption with Amazon S3-managed keys, Amazon S3 manages the encryption keys for you.
- AWS KMS-Managed Encryption Keys (SSE-KMS)
- When you use server-side encryption with AWS Key Management Service (KMS), you specify the Amazon resource name (ARN) of the AWS KMS master encryption key that you want to use.
- Customer-Provided Encryption Keys (SSE-C)
- When you use server-side encryption with customer-provided keys, you specify the Base64 encoded 256-bit encryption key.
For more information about using server-side encryption to protect data in Amazon S3, see the Amazon S3 documentation.
Write Mode
The write mode determines how the Amazon S3 destination writes objects to Amazon S3. When writing objects, the resulting names are based on the selected data format.
- Overwrite files
- Before writing any data from a batch, the destination removes all objects from the specified bucket and path.
- Overwrite related partitions
- Before writing data from a batch, the destination removes the objects from subfolders for which a batch has data. The destination leaves a subfolder intact if the batch has no data for that subfolder.
- Write new files to new directory
- When the pipeline starts, the destination writes objects in the specified bucket and path. Before writing data for each subsequent batch during the pipeline run, the destination removes any objects from the bucket and path. The destination generates an error if the specified bucket and path contains objects when you start the pipeline.
- Write new or append to existing files
- The destination creates new objects if they do not exist or appends data to an existing object if the object exists in the same bucket and path.
Spark Requirement to Overwrite Related Partitions
If you set Write Mode to Overwrite Related Partitions, you must configure Spark to overwrite partitions dynamically.
Set the Spark configuration property
spark.sql.sources.partitionOverwriteMode
to
dynamic
.
You can configure the property in Spark, or you
can configure the property in individual pipelines. For example, you might set the
property to dynamic
in Spark when you plan to enable the Overwrite
Related Partitions mode in most of your pipelines, and then set the property to
static
in individual pipelines that do not use that mode.
To configure the property in an individual pipeline, add an extra Spark configuration property on the Cluster tab of the pipeline properties.
Partition Objects
Pipelines process data in partitions. The Amazon S3 destination writes objects, which contain the processed data, in the configured bucket and path. The destination writes one object for each partition.
The destination groups the partition objects if you specify one or more fields in the Partition by Fields property. For each unique value of the fields specified in the Partition by Fields property, the destination creates a folder. In each folder, the destination writes one object for each partition that has the corresponding field and value. Because the folder name includes the field name and value, the object omits that data. With grouped partition objects, you can more easily find data with certain values, such as all the data for a particular city.
To overwrite folders that have updated data and leave other folders intact, set the Write Mode property to Overwrite Related Partitions. Then, the destination clears affected folders before writing, replacing objects in those folders with new objects.
If the Partition by Fields property lists no fields, the destination does not group partition objects and writes one object for each partition directly in the configured bucket and path.
Because the text data format only contains data from one field, the destination does not group partition objects for the text data format. Do not configure the Partition by Fields property if the destination writes in the text data format.
Example: Grouping Partition Objects
Suppose your pipeline processes orders. You want to write data in the Sales bucket and group the data by cities. Therefore, in the Bucket and Path property, you enter Sales, and in the Partition by Fields property, you enter City. Your pipeline is a batch pipeline that only processes one batch. You want to overwrite the entire Sales bucket each time you run the pipeline. Therefore, in the Write Mode property, you select Overwrite Files.
As the pipeline processes the batch, the origins and processors lead Spark to split the data into three partitions. The batch contains two values in the City field: Oakland and Fremont. Before writing the processed data in the Sales bucket, the destination removes any existing objects and folders and then creates two folders, City=Oakland and City=Fremont. In each folder, the destination writes one object for each partition that contains data for that city, as shown below:
Note that the written objects do not include the City field; instead, you infer the city from the folder name. The first partition does not include any data for Fremont; therefore, the City=Fremont folder does not contain an object from the first partition. Similarly, the third partition does not include any data for Oakland; therefore, the City=Oakland folder does not contain an object from the third partition.
Data Formats
The Amazon S3 destination writes records based on the specified data format.
- Avro
- The destination writes an object for each partition and includes the Avro schema in each object.
- Delimited
- The destination writes an object for each partition. It creates a header
line for each file and uses
\n
as the newline character. You can specify a custom delimiter, quote, and escape character to use in the data.
- JSON
- The destination writes an object for each partition and writes each record on a separate line. For more information, see the JSON Lines website.
- ORC
- The destination writes an object for each partition.
- Parquet
- The destination writes an object for each partition and includes the Parquet schema in every object.
- Text
- The destination writes an object for every partition and uses
\n
as the newline character. - XML
- The destination writes an object for every partition. You specify the root and row tags to use in output files.
Configuring an Amazon S3 Destination
-
In the Properties panel, on the General tab, configure the
following properties:
General Property Description Name Stage name. Description Optional description. Stage Library Stage library to use to connect to Amazon S3: - AWS cluster-provided libraries - The cluster where the pipeline runs has Apache Hadoop Amazon Web Services libraries installed, and therefore has all of the necessary libraries to run the pipeline.
- AWS Transformer-provided libraries for Hadoop 2.7.7
- Transformer passes the necessary libraries with the pipeline
to enable running the pipeline.
Use when running the pipeline locally or when the cluster where the pipeline runs does not include the Amazon Web Services libraries required for Hadoop 2.7.7.
- AWS Transformer-provided libraries for Hadoop 3.2.0
- Transformer passes the necessary libraries with the pipeline
to enable running the pipeline.
Use when running the pipeline locally or when the cluster where the pipeline runs does not include the Amazon Web Services libraries required for Hadoop 3.2.0.
- AWS Transformer-provided libraries for Hadoop 3.3.4
- Transformer passes the necessary libraries with the pipeline
to enable running the pipeline.
Use when running the pipeline locally or when the cluster where the pipeline runs does not include the Amazon Web Services libraries required for Hadoop 3.3.4.
Note: When using additional Amazon stages in the pipeline, ensure that they use the same stage library. -
On the Amazon S3 tab, configure the following
properties:
Amazon S3 Property Description Connection Connection that defines the information required to connect to an external system. To connect to an external system, you can select a connection that contains the details, or you can directly enter the details in the pipeline. When you select a connection, Control Hub hides other properties so that you cannot directly enter connection details in the pipeline.
Authentication Method Authentication method used to connect to Amazon Web Services (AWS): - AWS Keys - Authenticates using an AWS access key pair.
- Instance Profile - Authenticates using an instance profile associated with the Transformer EC2 instance.
- None - Connects to a public bucket using no authentication.
Access Key ID AWS access key ID. Required when using AWS keys to authenticate with AWS. Secret Access Key AWS secret access key. Required when using AWS keys to authenticate with AWS. Assume Role Temporarily assumes another role to authenticate with AWS. Important: Transformer supports assuming another role when the pipeline meets the stage library and cluster type requirements.Role Session Name Optional name for the session created by assuming a role. Overrides the default unique identifier for the session.
Available when assuming another role.
Role ARN Amazon resource name (ARN) of the role to assume, entered in the following format:
arn:aws:iam::<account_id>:role/<role_name>
Where
<account_id>
is the ID of your AWS account and<role_name>
is the name of the role to assume. You must create and attach an IAM trust policy to this role that allows the role to be assumed.Available when assuming another role.
Session Timeout Maximum number of seconds for each session created by assuming a role. The session is refreshed if the pipeline continues to run for longer than this amount of time.
Set to a value between 3,600 seconds and 43,200 seconds.
Available when assuming another role.
Set Session Tags Sets a session tag to record the name of the currently logged in StreamSets user that starts the pipeline or the job for the pipeline. AWS IAM verifies that the user account set in the session tag can assume the specified role.
Select only when the IAM trust policy attached to the role to be assumed uses session tags and restricts the session tag values to specific user accounts.
When cleared, the connection does not set a session tag.
Available when assuming another role.
External ID External ID included in an IAM trust policy that allows the specified role to be assumed. Available when assuming another role.
Use Specific Region Specify the AWS region or endpoint to connect to. When cleared, the stage uses the Amazon S3 default global endpoint, s3.amazonaws.com.
Region AWS region to connect to. Select one of the available regions. To specify an endpoint to connect to, select Other. Endpoint Endpoint to connect to when you select Other for the region. Enter the endpoint name. Bucket and Path Location to write objects. Include an Amazon S3 bucket name with an optional folder path. Specify the location with thes3
ors3a
URI scheme, as follows:- For EMR
clusters:
s3://<bucket name>/<path to objects>/
- For all other
clusters:
s3a://<bucket name>/<path to objects>/
The instance profile or AWS access key pair used to authenticate with AWS must have write access to the bucket.
The specified bucket must exist before you run the pipeline.
Write Mode Mode to write objects: - Overwrite files - Removes all objects from the specified bucket and path.
- Overwrite related partitions - Removes the objects from subfolders for which a batch has data.
- Write new or append to existing files - Creates new objects or appends data to an existing object.
- Write new files to new directory - Writes objects in a new bucket and path.
Server-Side Encryption Option Option that Amazon S3 uses to manage encryption keys for server-side encryption: - None - Do not use server-side encryption.
- SSE-S3 - Use Amazon S3-managed keys.
- SSE-KMS - Use Amazon Web Services KMS-managed keys.
- SSE-C - Use customer-provided keys.
Default is SSE-S3.
AWS KMS Key ARN Amazon resource name (ARN) of the AWS KMS master encryption key that you want to use. Use the following format: arn:<partition>:kms:<region>:<account-id>:key/<key-id>
For example:
arn:aws:kms:us-west-2:111122223333:key/1234abcd-12ab-34cd-56ef-1234567890ab
Used for SSE-KMS encryption only.
Customer Encryption Key The 256-bit Base64 encoded encryption key to use. Used for SSE-C encryption only.
Partition by Fields Fields used to group partition objects. The destination creates a subfolder for each unique field value and writes an object in the subfolder for each processed partition that contains the corresponding field and value. In a file dimension pipeline that writes to a grouped file dimension, specify the fields that are listed in the Key Fields property of the Slowly Changing Dimension processor.
Exclude Unrelated SCD Master Records Does not write slowly changing dimension master records that are unrelated to the change data. Writes only the master records related to the change data. Use only in a slowly changing dimension pipeline that writes to a grouped file dimension. In pipelines that write to an ungrouped file dimension, clear this option to include all master records.
For guidelines on configuring slowly changing dimension pipelines, see Pipeline Configuration.
-
On the Advanced tab, optionally configure the following
properties:
Advanced Property Description Additional Configuration Additional HDFS properties to pass to an HDFS-compatible file system. Specified properties override those in Hadoop configuration files.
To add properties, click the Add icon and define the HDFS property name and value. You can use simple or bulk edit mode to configure the properties. Use the property names and values as expected by your version of Hadoop.
Max Threads Maximum number of concurrent threads to use for parallel uploads. Buffer Hint TCP socket buffer size hint, in bytes.
Default is 8192.
Maximum Connections Maximum number of connections to Amazon. Default is 1.
Connection Timeout Seconds to wait for a response before closing the connection. Socket Timeout Seconds to wait for a response to a query. Retry Count Maximum number of times to retry requests. Use Proxy Specifies whether to use a proxy to connect. Proxy Host Proxy host. Proxy Port Proxy port. Proxy User User name for proxy credentials. Proxy Password Password for proxy credentials. Proxy Domain Optional domain name for the proxy server. Proxy Workstation Optional workstation for the proxy server. -
On the Data Format tab, configure the following
property:
Data Format Property Description Data Formats Format of the data. Select one of the following formats: - Avro (Spark 2.4 or later) - For Avro data processed by Spark 2.4 or later.
- Avro (Spark 2.3) - For Avro data processed by Spark 2.3.
- Delimited
- JSON
- ORC
- Parquet
- Text
- XML
-
For delimited data, on the Data Format tab, configure the
following property:
Delimited Property Description Delimiter Character Delimiter character to use in the data. Select one of the available options or select Other to enter a custom character. You can enter a Unicode control character using the format \uNNNN, where N is a hexadecimal digit from the numbers 0-9 or the letters A-F. For example, enter \u0000 to use the null character as the delimiter or \u2028 to use a line separator as the delimiter.
Quote Character Quote character to use in the data. Escape Character Escape character to use in the data -
For text data, on the Data Format tab, configure the
following property:
Text Property Description Text Field String field in the record that contains the data to be written. All data must be incorporated into the specified field. -
For XML data, on the Data Format tab, configure the
following properties:
XML Property Description Root Tag Tag to use as the root element. Default is ROWS, which results in a <ROWS> root element.
Row Tag Tag to use as a record delineator. Default is ROW, which results in a <ROW> record delineator element.