Amazon Redshift
The Amazon Redshift destination writes data to an Amazon Redshift table. Use the destination in Databricks, Dataproc, or EMR pipelines.
The Amazon Redshift destination stages data on Amazon S3 before writing it to Redshift. When you configure the destination, you specify the staging location on S3. You can optionally enable server-side encryption and have the destination delete objects from S3 after writing them to Redshift. You can also use a connection to configure the destination.
You define the Amazon Redshift endpoint, schema, and table to write to. You can optionally define fields to partition by.
You specify the write mode to use: insert, merge, or delete. You can use compound keys for merges and deletes. When merging data, you can specify a distribution key to improve performance. When deleting data, you can specify the action to take for records with no match in the table.
You can optionally have the destination truncate the table before writing to it. You can also have the destination create the table. When creating the table, you specify the Redshift distribution style to use, and you specify the default length and any custom lengths that you want to use for Varchar fields.
You configure security credentials and the database user for the write. When using AWS access keys, you can have the destination automatically create the user. You can also configure advanced properties such as performance-related properties and proxy server properties.
Before using the Amazon Redshift destination, verify if you need to install a JDBC driver.
StreamSets has tested this destination with Amazon Redshift versions 5.13.0 and 5.29.0.
Installing a JDBC Driver
- Instance profile
- When using an instance profile to authenticate with Redshift, you do not need to install a JDBC driver. The Amazon Redshift destination uses a JDBC driver included with the destination.
- AWS access keys
- When using AWS access keys to authenticate with Redshift, you must install an Amazon Redshift JDBC driver.
Transformer bundles a JDBC driver into the launched Spark application so that the driver is available on each node in the cluster.
Credentials and Writing to Redshift
You can specify how securely the destination authenticates with Amazon Redshift. The credentials that you use determine the additional information that you must provide and whether you need to install a JDBC driver. Configure credentials on the Credentials tab of the destination.
The destination can authenticate using the following credentials:
- Instance profile
- When the EC2 nodes in the EMR cluster have an associated instance profile, Transformer uses the instance profile credentials to automatically authenticate with AWS. The IAM policies attached to the instance profile must have permissions to write to Amazon S3 and to the Redshift cluster.
- AWS access keys
- You can authenticate using an AWS access key pair. When using an AWS access
key pair, you specify the access key ID and secret access key to use. The
AWS access key pair must have permissions to write to Amazon S3 and to the
Redshift destination. When you use AWS access keys, you also specify the following details to enable writing to Redshift:
- DB User - Database user that Transformer impersonates when writing to the database. The user must have write permission for the database table.
- Auto-Create DB User - Enables creating a database user to write data to Redshift.
- DB Groups - Comma-delimited list of existing database groups for the database user to join for the duration of the pipeline run. The specified groups must have write permission for the S3 staging location.
Server-Side Encryption
You can configure the destination to use Amazon Web Services server-side encryption (SSE) to protect data staged on Amazon S3. When configured for server-side encryption, the destination passes required server-side encryption configuration values to Amazon S3. Amazon S3 uses the values to encrypt the data as it is written to Amazon S3.
- Amazon S3-Managed Encryption Keys (SSE-S3)
- When you use server-side encryption with Amazon S3-managed keys, Amazon S3 manages the encryption keys for you.
- AWS KMS-Managed Encryption Keys (SSE-KMS)
- When you use server-side encryption with AWS Key Management Service (KMS), you specify the Amazon resource name (ARN) of the AWS KMS master encryption key that you want to use.
- Customer-Provided Encryption Keys (SSE-C)
- When you use server-side encryption with customer-provided keys, you specify the Base64 encoded 256-bit encryption key.
For more information about using server-side encryption to protect data in Amazon S3, see the Amazon S3 documentation.
Write Mode
The write mode determines how the Amazon Redshift destination writes to Redshift.
- Insert
- Inserts all data to the table.
- Merge
- Inserts new data and updates existing data in the table. When merging data, you specify one or more primary key columns to uniquely identify each record to be written. If the table includes a stable distribution key, you can also specify the distribution key column, which can improve performance.
- Delete
- Deletes rows with matching key columns. When deleting data, you specify one or more primary key columns to uniquely identify each record to be deleted. You also specify the action to take for records without matching rows in the table:
Partitioning
Spark runs a Transformer pipeline just as it runs any other application, splitting the data into partitions and performing operations on the partitions in parallel.
When the pipeline starts processing a new batch, Spark determines how to split pipeline data into initial partitions based on the origins in the pipeline. Spark uses these partitions for the rest of the pipeline processing, unless a processor causes Spark to shuffle the data.
When staging data on Amazon S3, Spark creates one object for each partition. When you configure the Amazon Redshift destination, you can specify fields to partition by. You can alternatively use a Repartition processor earlier in the pipeline to partition by fields or to specify the number of partitions that you want to use.
Configuring an Amazon Redshift Destination
Configure an Amazon Redshift destination to write data to an Amazon Redshift table. Before using the destination, verify if you need to install a JDBC driver.
-
In the Properties panel, on the General
tab, configure the following properties:
General Property Description Name Stage name. Description Optional description. -
On the Amazon S3 tab, configure the following
properties:
Redshift Property Description Bucket S3 location to stage data before writing it to Amazon Redshift. Use the following format:s3a://<bucket name>/<path to objects>/
The bucket must exist before you start the pipeline.
Server-Side Encryption Option Option that Amazon S3 uses to manage encryption keys for server-side encryption: - None - Do not use server-side encryption.
- SSE-S3 - Use Amazon S3-managed keys.
- SSE-KMS - Use Amazon Web Services KMS-managed keys.
- SSE-C - Use customer-provided keys.
Default is SSE-S3.
AWS KMS Key ARN Amazon resource name (ARN) of the AWS KMS master encryption key that you want to use. Use the following format: arn:<partition>:kms:<region>:<account-id>:key/<key-id>
For example:
arn:aws:kms:us-west-2:111122223333:key/1234abcd-12ab-34cd-56ef-1234567890ab
Used for SSE-KMS encryption only.
Customer Encryption Key The 256-bit Base64 encoded encryption key to use. Used for SSE-C encryption only.
Delete Data from S3 Deletes staged data after it is written to Amazon Redshift. Partition by Fields Fields to use to partition the data. -
On the Redshift tab, configure the following
properties:
Redshift Property Description Connection Connection that defines the information required to connect to an external system. To connect to an external system, you can select a connection that contains the details, or you can directly enter the details in the pipeline. When you select a connection, Control Hub hides other properties so that you cannot directly enter connection details in the pipeline.
Redshift Endpoint Amazon Redshift endpoint to use. Schema Schema to write to. Table Amazon Redshift table to write to. Write Mode Mode to write to Redshift: - Insert - Inserts all data to the table.
- Merge - Inserts new data and updates existing data in the table.
- Delete - Deletes rows with matching key columns.
Primary Keys Name of one or more key columns in the table. Used to perform updates and deletes. Pipeline records must include a matching field name.
Click the Add icon to add additional key columns. The property is case sensitive.
Available with the Merge and Delete write modes.
Distribution Key for Merge Optional distribution key column. When the table includes a stable distribution key, specifying the column name can improve pipeline performance. If you expect the distribution key column to change, do not specify a distribution key. If you configure this property and also configure the destination to autocreate a table using the key distribution style, make sure to specify the same column name for both distribution key properties.
Available with the Merge write mode.
Action for Unmatched Keys Action to take for records without matching rows in the table: - Insert - Inserts records that do not have matching rows in the table.
- Drop - Drops records that do not have matching rows in the table.
Available with the Delete write mode.
Truncate Table Truncates all rows as the pipeline starts before writing to the table. Auto-create Table Creates the specified table in the specified schema. The destination creates the table columns based on the data written to the destination. Distribution Style Determines how the data is distributed across nodes: - Auto - Redshift chooses an optimum distribution style based on the size of data.
- Even - Redshift distributes data evenly across all nodes.
- Key - Redshift distributes data based on the specified distribution key column, placing matching values on the same node slice.
- All - Redshift writes a copy of the table to all nodes.
For more information about distribution styles, see the Amazon Redshift documentation.
Available only when auto-creating the table.
Distribution Key Column to use as the distribution key for the table. If you configure this property and also configure the destination to merge data using a distribution key, make sure to specify the same column name for both distribution key properties.
Available for the Key distribution style.
Default Varchar Column Length Default length for Varchar columns when automatically creating a table. Custom Varchar Column Lengths Defines the custom lengths for the specified Varchar columns. Specify the column name and column length that you want to use. Click the Add icon to define additional custom lengths.
-
On the Credentials tab, configure the following
properties:
Credentials Property Description Security Mode to use to authenticate with AWS: - AWS Keys - Authenticates
using an AWS access key pair.Note: When using AWS keys, make sure you install a JDBC driver.
- Instance Profile - Authenticates using an instance profile associated with the Transformer EC2 instance.
Access Key ID AWS access key ID. Required when using AWS keys to authenticate with AWS. Secret Access Key AWS secret access key. Required when using AWS keys to authenticate with AWS. DB User Database user that Transformer impersonates when writing to the database. The user must have write permission for the database table. DB Password Password for the database user account. Available when using Instance Profile security.
IAM Role for Unload to S3 ARN of the IAM role assigned to the Redshift cluster. Transformer uses the role to write to the specified S3 staging location. The role must have write permission for the S3 staging location. Available when using Instance Profile security.
Auto-Create DB User Enables creating a database user to write data to Redshift. Available when using AWS Keys security.
DB Groups Comma-delimited list of existing database groups for the database user to join for the duration of the pipeline run. The specified groups must have write permission for the S3 staging location. Available when using AWS Keys security.
- AWS Keys - Authenticates
using an AWS access key pair.
-
On the Advanced tab, optionally configure the following
properties:
Advanced Property Description Additional Configuration Additional HDFS properties to pass to an HDFS-compatible file system. Specified properties override those in Hadoop configuration files.
To add properties, click the Add icon and define the HDFS property name and value. You can use simple or bulk edit mode to configure the properties. Use the property names and values as expected by your version of Hadoop.
Max Threads Maximum number of concurrent threads to use for parallel uploads. Buffer Hint TCP socket buffer size hint, in bytes.
Default is 8192.
Maximum Connections Maximum number of connections to Amazon. Default is 1.
Connection Timeout Seconds to wait for a response before closing the connection. Socket Timeout Seconds to wait for a response to a query. Retry Count Maximum number of times to retry requests. Use Proxy Specifies whether to use a proxy to connect. Proxy Host Proxy host. Proxy Port Proxy port. Proxy User User name for proxy credentials. Proxy Password Password for proxy credentials. Proxy Domain Optional domain name for the proxy server. Proxy Workstation Optional workstation for the proxy server.