Teradata

Supported pipeline types:
  • Data Collector

The Teradata destination writes data to one or more tables in a Teradata database. For information about supported versions, see Supported Systems and Versions.

You can use the Teradata destination to write new data or change data capture (CDC) data to Teradata. When processing new data, the destination can load data to Teradata using the COPY command. When processing CDC data, the destination uses the MERGE command.

The Teradata destination writes data from record fields to table columns based on matching names. The destination can compensate for data drift by creating new columns and tables in Teradata when new fields and table references appear in records.

The Teradata destination uses a JDBC URL to connect to the Teradata database. When you configure the destination, you specify the JDBC URL and credentials to use to connect to the database.

You configure the Teradata database and tables to use. You specify the load method, error handling, and staging details.

You can optionally enable data drift. You can have the destination create new tables, as needed. You can also specify whether to create all new columns as String instead of inferring the type.

Before you use the Teradata destination, you must complete several prerequisite tasks.

Prerequisites

Before you configure the Teradata destination, complete the following prerequisites:
  1. Install the Teradata driver for the database.
  2. If using external staging, create a Teradata Authorization for the stage. For information on creating a Teradata Authorization, see the Teradata documentation.
  3. To use the COPY command to load new data, complete the COPY prerequisites.
  4. To use the MERGE command to load CDC data, complete the MERGE prerequisites.

Installing the Teradata Driver

Before you use the Teradata destination, install the Teradata driver for the database. You cannot access the database until you install the required driver.

You install the driver into the Teradata stage library, streamsets-datacollector-teradata-lib, which includes the destination.

For information about installing additional drivers, see Install External Libraries.

COPY Prerequisites

When processing new data, you can configure the destination to use the COPY command to load data to Teradata tables.

Using the COPY command to load data requires a role with one of the following sets of access privileges:

  • Required privileges when using a local stage:
    Object Type Privilege
    Table SELECT, INSERT
  • Required privileges when using an external stage:
    Object Type Privilege
    Authorization USAGE
    Table SELECT, INSERT

MERGE Prerequisites

When processing CDC data, you can configure the destination to use the MERGE command to load data to Teradata tables.

Using the MERGE command to load data requires a role with one of the following sets of access privileges:

  • Required privileges when using a local stage:
    Object Type Privilege
    Table SELECT, INSERT, UPDATE, DELETE
  • Required privileges when using an external stage:
    Object Type Privilege
    Authorization USAGE
    Table SELECT, INSERT, UPDATE, DELETE

Load Methods

The Teradata destination loads data differently depending on the configured staging location:
  • When using local staging, the destination uses Teradata FastLoad capabilities to create temporary tables.
  • When using external staging, the destination uses external tables to create temporary tables.

The destination can load data to Teradata tables using the following methods:

COPY command for new data
The COPY command, the default load method, performs a bulk synchronous load to Teradata, treating all records as INSERTS. Use this method to write new data to Teradata tables.
When using the COPY command, the destination first stages the pipeline data in text files in the specified staging location, then copies the staged data to the target Teradata tables.
Since the COPY command is the default load method, you do not need to configure the destination to use this command.
MERGE command for CDC data
Like the COPY command, the MERGE command performs a bulk synchronous load to Teradata. But instead of treating all records as INSERT, it inserts, updates, upserts, and deletes records as appropriate. Use this method to write change data capture (CDC) data to Teradata tables using CRUD operations.
When using the MERGE command, the destination first stages the pipeline data in text files in the specified staging location. Then the destination runs the COPY command to load the data to a temporary Teradata table, and then finally runs a MERGE command that uses the temporary table to merge the changed data into the target Teradata table.
Important: To maintain the original order of data, do not use multiple threads or cluster execution mode when processing CDC data.
To use the MERGE command to load CDC data, select the Processing CDC Data (Use MERGE) property on the Data tab of the destination. Then, configure the Primary Key Location property to indicate the primary key columns to use.

Use the recommended guidelines to optimize for performance and cost-effectiveness.

For more information about the COPY or MERGE commands, see the Teradata documentation.

Primary Key Location

When you configure the Teradata destination to write CDC data using the MERGE command, you must provide primary key information. You can configure the destination to access primary key information in the following ways:

Use JDBC record header attribute
The destination uses primary key columns stored in the jdbc.primaryKeySpecification record header attribute.
Several origins, such as the Oracle CDC Client origin and the PostgreSQL CDC Client origin, populate the record header attribute automatically. You can also use a processor, such as the Expression Evaluator, to populate the header attribute within the pipeline.
Use specified columns for each table
The destination uses the primary key columns that you specify in the Table Key Columns properties.
Note: When key columns are known, entering them manually is more efficient than the using the JDBC record header attribute.

Performance Optimization

Use the following tips to optimize for performance and cost-effectiveness when using the Teradata destination:

Increase the batch size
The maximum batch size is determined by the origin in the pipeline and typically has a default value of 1,000 records. To take advantage of the Teradata loading abilities when writing to Teradata tables using the COPY or MERGE commands, increase the maximum batch size in the pipeline origin to 20,000-50,000 records. Be sure to increase the Data Collector Java heap size, as needed.
Important: Increasing the batch size is strongly recommended. Using the default batch size can be slow and costly.
Use multiple threads
When writing to Teradata tables using the COPY command, you can use multiple threads to improve performance by including a multithreaded origin in the pipeline. When Data Collector resources allow, using multiple threads enables processing multiple batches of data concurrently. As with increasing the batch size, when using multiple threads, you should make sure that the Data Collector Java heap size is sized appropriately.
Note: Do not use multiple threads to write CDC data to Teradata tables with the MERGE command. When using multiple threads to process data, the original order of the data is not retained.
Enable additional connections to Teradata
When writing to multiple Teradata tables using the COPY or MERGE commands, increase the number of connections that the Teradata destination makes to Teradata. Each additional connection allows the destination to write to an additional table, concurrently.
For example, when writing to 10 tables with only one connection, the destination can only write to one table at a time. With 5 connections, the destination can write to 5 tables at a time. 10 connections enables writing to all 10 tables at the same time.

By default, the destination uses one connection for standard single-threaded pipelines. In multithreaded pipelines, the destination matches the number of threads used by the pipeline. That is, when a multithreaded origin is configured to use up to 3 threads, then by default, the Teradata destination uses 3 connections to write to Teradata tables, one for each thread.

Note that the number of connections is for the entire pipeline, not for each thread. So when using multiple threads to write to multiple tables, you can also improve performance by allocating additional connections. For example, when using 3 threads to write to 3 tables, you might increase the number of connections to 9 for maximum throughput.

Use the Connection Pool Size property to specify the maximum number of connections that the destination can use.

Specifying Tables

You can use the Teradata destination to write to one or more tables in a Teradata schema. Specify the tables to use based on how many tables you want to write to:
Single table
To write data to a single table, simply enter table name as follows:
<table_name>
Multiple tables
To write data to multiple tables, specify a field in the record that defines the database and tables.
For example, say you have tables named after departments in your company, such as Operations, Sales, and Marketing. The records being processed have a dept field with matching values. You configure the destination to write records to the various tables using the following expression: ${record:value('/dept')}.
When using the COPY or MERGE command to load data, you can configure the Teradata destination to automatically create tables when a new value appears in the specified field. For example, if the dept field suddenly includes an Engineering department, the destination can create a new Engineering table in Teradata for the new data. For more information, see Enabling Data Drift Handling.
When writing to multiple tables, you might also increase the number of connections that the destination uses for the write. For more information, see Performance Optimization.

Use the Table property on the Teradata tab to specify the tables to write to.

Enabling Data Drift Handling

The Teradata destination can automatically compensate for changes in column or table requirements, also known as data drift.

Note: If you change the destination table schema manually rather than enabling data drift handling, you must restart the pipeline to allow the destination to discover schema changes.
The destination can handle data drift in the following ways:
Create new columns

The destination can create new columns in Teradata tables when new fields appear in records. For example, if a record suddenly includes a new Address2 field, the destination creates a new Address2 column in the target table.

By default, the destination creates new columns based on the data in the new fields, such as creating a Double column for decimal data. You can, however, configure the destination to create all new columns as String.

To enable the automatic creation of new columns, select the Enable Data Drift property on the Teradata tab. To create new columns as String, select the Create New Columns as String property.
Create new tables
The destination can create new tables as needed. For example, say the destination writes data to tables based on the region name in the Region field. When a new SW-3 region shows up in a record, the destination creates a new SW-3 table in Teradata and writes the record to the new table.

You can use this functionality to create all necessary tables in an empty Teradata database schema.

To enable the creation of new tables, select the Auto Create Table property.

Generated Data Types

When creating new tables or creating new columns in existing tables, the Teradata destination uses field names to generate the new column names.

You can configure the destination to create all new columns as String. However, by default, the Teradata destination creates columns as follows:
Record Field Data Type Teradata Column Data Type
Byte Byte
Byte Array Varbyte
Char Character
String Varchar
Decimal Decimal
Long Bigint
Float Float
Integer Integer
Short Smallint
Time Time
Date Date
Datetime Timestamp
Zoned_Datetime Timestamp With Time Zone
Boolean Varchar
Double Float
File_Ref Varchar

The destination does not support nested Data Collector data types: List, List-Map, and Map. By default, the destination treats fields with invalid data types as an error record. You can configure the destination to ignore fields with invalid data types, as described in Missing Fields and Fields with Invalid Types.

Tip: To process fields in a record including nested list, list-map or map fields, use a Field Flattener processor to flatten the entire record to produce a record with no nested fields.

Creating Tables

If you configure the Teradata destination to handle data drift, you can also configure the destination to create tables. Once configured, the destination creates tables when the specified or needed tables do not exist in Teradata.

With the Auto Create Table property enabled on the Teradata tab, the destination creates tables, columns, and primary key columns as follows:
Table
The destination creates a table if the table specified in the Table property does not exist.
While handling data drift, the destination creates tables needed to write records to tables that do not exist.
The destination names new tables based on the table name or expression specified in the Table property.
Table columns
In the created tables, the destination determines the columns from the first batch of data processed. The destination infers the data types from the data in the batch.
The columns in the created tables are nullable, unless they are primary key columns.
Primary key columns
In the created tables, the destination creaties primary key columns. If configured to use the COPY command for new data, the destination uses the primary key information in the jdbc.primaryKeySpecification record header attribute to create primary key columns.

Several origins, such as the Oracle CDC Client origin and the PostgreSQL CDC Client origin, populate the record header attribute automatically. You can also use a processor, such as the Expression Evaluator, to populate the header attribute within the pipeline.

When the jdbc.primaryKeySpecification record header attribute does not exist, the destination creates tables without primary key columns.

The destination does not support primary key changes for existing tables.

Staging Location

The Teradata destination first stages the pipeline data in text files in a specified location. Then, the destination sends the COPY or MERGE command to Teradata to process the staged files.

The destination can stage data in one of the following locations:
Amazon S3
After selecting Amazon S3 as the staging location, specify the existing S3 bucket to stage the files to. You also specify the credentials that the destination uses to connect to Amazon S3.
ADLS Gen2
After selecting ADLS Gen2 as the staging location, specify the name of the existing Azure account and storage container to stage the files to. You then configure the destination to use the appropriate authentication method to connect to Azure Data Lake Storage Gen2.
Azure Blob Storage
After selecting Azure Blob Storage as the staging location, specify the fully-qualified domain name of the Blob Storage account and the name of the storage containter or file system to stage files to. You then configure the destination to use the appropriate authentication method to connect to Azure Blob Storage.
Google Cloud Storage
After selecting Google Cloud Storage as the staging location, specify the project ID and an existing Google Cloud Storage bucket to use. You also define credential provider details.
Local FileSystem
After selecting Local FileSystem as the staging location, specify the local directory to stage the files to and a prefix for the staged files. You can optionally configure a limit on the number of FastLoad sessions for staging. For more information on the Teradata FastLoad utility, see the Teradata documentation.

For all staging locations, you specify the stage file name prefix and whether the destination removes a staged file after its data is written to Teradata tables.

Amazon S3 Credentials

When you configure the destination to connect to an Amazon S3 staging location, the destination must pass credentials to Amazon Web Services. Use one of the following methods to pass AWS credentials:

Instance profile
When Data Collector runs on an Amazon EC2 instance that has an associated instance profile, Data Collector uses the instance profile credentials to automatically authenticate with AWS.
To use an instance profile, you enable the Use Instance Profile property on the Staging tab.
For more information about associating an instance profile with an EC2 instance, see the Amazon EC2 documentation.
AWS access key pair
When Data Collector does not run on an Amazon EC2 instance or when the EC2 instance doesn’t have an instance profile, you can connect with an AWS access key pair. To connect with an AWS access key pair, you specify the access key ID and secret access key on the Staging tab.
Tip: To secure sensitive information such as access key pairs, you can use runtime resources or credential stores.

Azure Authentication Information

When you configure the destination to connect to an ADLS Gen2 or Azure Blob Storage staging location, you select the authentication method that the destination uses to connect to Azure Data Lake Storage Gen2 or Azure Blob Storage.

Select one of the following authentication methods:
OAuth with Service Principal
Connections made with OAuth with Service Principal authentication require the following information:
  • Application ID - Application ID for the Azure Active Directory Data Collector application. Also known as the client ID.

    For information on accessing the application ID from the Azure portal, see the Azure documentation.

  • Endpoint - OAuth 2.0 token endpoint for the Azure Active Directory v1.0 application for Data Collector. For example: https://login.microsoftonline.com/<uuid>/oauth2/token.
  • Application Key - Authentication key or client secret for the Azure Active Directory application. Also known as the client secret.

    For information on accessing the application key from the Azure portal, see the Azure documentation.

Azure Managed Identity
Connections made with Azure Managed Identity authentication require the following information:
  • Application ID - Application ID for the Azure Active Directory Data Collector application. Also known as the client ID.

    For information on accessing the application ID from the Azure portal, see the Azure documentation.

Shared Key
Connections made with Shared Key authentication require the following information:
  • Account Shared Key - Shared access key that Azure generated for the storage account.

    For more information on accessing the shared access key from the Azure portal, see the Azure documentation.

SAS Token
Connections made with SAS Token authentication require the following information:
  • Azure SAS Token - Shared access signature (SAS) token that provides secure access to the needed resources in Azure Blob Storage.

    For more information on SAS tokens for storage containers, see the Azure documentation.

SAS Token authentication is available only for Azure Blob Storage staging.

Google Cloud Credentials

Before staging data on Google Cloud Storage, the Teradata destination must pass credentials to Google Cloud.

You can provide credentials using one the following options:
  • Google Cloud default credentials
  • Credentials in a file
  • Credentials in a stage property

For details on how to configure each option, see Security in Google Cloud Stages.

Row Generation

When writing a record to a table, the Teradata destination includes all record fields in the resulting row, by default. The destination uses the root field, /, as the basis for the resulting row.

You can configure the Row Field property to specify a map or list-map field in the record as the basis for the row. The resulting record includes only the data from the specified map or list-map field and excludes all other record data. Use this functionality when the data that you want to write to Teradata exists in a single map or list-map field within the record.

If you want to use the root field, but do not want to include all fields in the resulting row, you can configure the destination to ignore all specified first-level fields.

The Teradata destination cannot convert map or list-map fields nested within the specified root field. The destination treats these fields as having an invalid data type. For example, say that you configure the destination to use the root field, /, as the basis for the resulting row. A record contains the following fields:
{
    "name": "Jane Smith",
    "id": "557",
    "address": {
        "street": "101 3rd St",
        "city": "Huntsville",
        "state": "NC",
        "zipcode": "27023"
     }
}

The destination treats the address map field as a field with an invalid data type, processing the field as an error record by default. You can configure the destination to ignore the field and process the remaining record data, as described in Missing Fields and Fields with Invalid Types.

Tip: To write all fields in a record including nested map or list-map fields, use a Field Flattener processor in the pipeline to flatten the entire record to produce a record with no nested fields.

Missing Fields and Fields with Invalid Types

By default, the destination treats records with missing fields or with invalid data types in fields as error records. You can configure the destination to ignore missing fields or fields with invalid data types, replacing the data in the field with an empty value.

You can specify a default value to use for each data type on the Data Advanced tab. For example, you might define the default value for a missing String field or a String field with an invalid data type as none or not_applicable.

To configure the destination to ignore missing fields and fields with invalid data types, select the Ignore Missing Fields and the Ignore Fields with Invalid Types properties on the Data Advanced tab.

CRUD Operation Processing

The Teradata destination can insert, update, upsert, or delete data when you configure the destination to process CDC data. When processing CDC data, the destination uses the MERGE command to write data to Teradata.

When writing data, the Teradata destination uses the CRUD operation specified in the sdc.operation.type record header attribute. The destination performs operations based on the following numeric values:
  • 1 for INSERT
  • 2 for DELETE
  • 3 for UPDATE
  • 4 for UPSERT

If your pipeline has a CRUD-enabled origin that processes changed data, the destination simply reads the operation type from the sdc.operation.type header attribute that the origin generates. If your pipeline has a non-CDC origin, you can use the Expression Evaluator processor or a scripting processor to define the record header attribute. For more information about Data Collector changed data processing and a list of CDC-enabled origins, see Processing Changed Data.

Configuring a Teradata Destination

Configure a Teradata destination to write data to Teradata tables.

  1. In the Properties panel, on the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
    Required Fields Fields that must include data for the record to be passed into the stage.
    Tip: You might include fields that the stage uses.

    Records that do not include all required fields are processed based on the error handling configured for the pipeline.

    Preconditions Conditions that must evaluate to TRUE to allow a record to enter the stage for processing. Click Add to create additional preconditions.

    Records that do not meet all preconditions are processed based on the error handling configured for the stage.

    On Record Error
    Error record handling for the stage:
    • Discard - Discards the record.
    • Send to Error - Sends the record to the pipeline for error handling.
    • Stop Pipeline - Stops the pipeline.
  2. On the Teradata tab, configure the following properties:
    Teradata Property Description
    JDBC URL JDBC URL used to connect to the Teradata cluster.

    Enter in the following format: jdbc:teradata://<server_hostname>

    User Teradata user name.
    Password Teradata password.
    Database Teradata database.
    Table Teradata tables to write to. To write to a single table, enter the table name.

    To write to multiple tables, enter an expression that evaluates to the field in the record that contains the table name.

    For example: ${record:value('/table')}

    Or, to write to tables based on the table name in the jdbc.table record header attribute defined by the JDBC Multitable Consumer origin, you can use the following expression: ${record:attribute('jdbc.tables')}

    Enable Data Drift Enables the destination to create new columns in Teradata tables when new fields appear in records.
    Auto Create Table Automatically creates tables when needed.

    Available when data drift is enabled.

    Create New Columns as String Creates all new columns as String. By default, the destination creates new columns based on the type of data in the field.

    Available when data drift is enabled.

    Maximum Connection Threads Maximum number of connections that the destination uses to write to Teradata. The default, 0, ensures that the destination uses the same number of connections as threads used by the pipeline.

    When writing to multiple tables using the COPY or MERGE command, increasing this property can improve performance.

    Note: Must be set to 0 when processing CDC data.
    Connection Properties Additional Teradata connection properties to use.

    To add properties, click Add and define the key and value. Use the keys and values as expected by Teradata.

    Queries per Second Maximum number of queries to run in a second across all partitions and tables. Use 0 for no limit.

    Default is 0.

    Query Maximum Wait Time (secs) Maximum number of seconds to wait for a query to finish before timing out. Use 0 for no limit.

    Default is 0.

    Maximum Number of Retries Maximum number of times to retry a query.

    Default is 1.

    Retry Interval (secs) Time between retries in seconds.

    Default is 10.

  3. On the Staging tab, configure the following properties:
    Staging Property Description
    Staging Location Location of the Teradata stage:
    • Amazon S3
    • ADLS Gen2
    • Azure Blob Storage
    • Google Cloud Storage
    • Local FileSystem

    This property configuration determines the properties that display on this tab and the Staging Advanced tab.

    Authorization Database Database of the Teradata Authorization to use to access the staging location.

    Required when using any staging location other than Local FileSystem.

    Authorization Name Name of the Teradata Authorization to use to access the staging location.

    Required when using any staging location other than Local FileSystem.

    Amazon S3 Authentication Method Authentication method to use when staging data.
    • AWS Keys - Use AWS access keys.
    • Instance Profile - Enables using an instance profile to write to an external stage on Amazon S3. Use only when Data Collector runs on an Amazon EC2 instance.
    • None

    Required when using the Amazon S3 staging location.

    Access Key ID

    AWS access key ID.

    Required when using AWS access key authentication to connect to Amazon S3.

    Secret Access Key

    AWS secret access key.

    Required when using AWS access key authentication to connect to Amazon S3.
    Bucket Name of the bucket to stage the files to.

    Required when staging data on Amazon S3 or Google Cloud Storage.

    Use Specific Region Specify the AWS region or endpoint to connect to.

    When cleared, the stage uses the Amazon S3 default global endpoint, s3.amazonaws.com.

    Account FQDN Fully qualified domain name of the Azure storage account. The destination can access Azure Data Lake Storage Gen2 or Blob storage accounts in either global Azure or Azure Government services.

    For example, valid domain names for global Azure services include <account-name>.dfs.core.windows.net or <account-name>.blob.core.windows.net

    Valid domain names for Azure Government services include <account-name>.dfs.core.usgovcloudapi.net or <account-name>.blob.core.usgovcloudapi.net

    Available when using the ADLS Gen2 or Azure Blob Storage staging location.

    Storage Container / File System Name of the existing Azure storage container or file system to stage the files to.

    Available when using the ADLS Gen2 or Azure Blob Storage staging location.

    Azure Authentication Method Authentication method used to connect to Azure:
    • OAuth with Service Principal
    • Azure Managed Identity
    • Shared Key
    • SAS Token - Azure Blob Storage only

    Available when using the ADLS Gen2 or Azure Blob Storage staging location.

    Application ID Application ID for the Azure Active Directory Data Collector application. Also known as the client ID.

    For information on accessing the application ID from the Azure portal, see the Azure documentation.

    Available when using the OAuth with Service Principal or the Azure Managed Identity authentication method.

    Endpoint Type Method to provide endpoint details.

    Available when using the OAuth with Service Principal authentication method.

    Tenant ID Tenant ID for the Azure Active Directory Data Collector application. Also known as the directory ID.

    For information on accessing the tenant ID from the Azure portal, see the Azure documentation.

    Available when Endpoint Type is set to Tenant ID.

    Endpoint URL Endpoint URL for the Azure Active Directory Data Collector application.

    Default is https://login.microsoftonline.com/<tenant-id>/oauth2/token.

    In the URL, specify the tenant ID for the Azure Active Directory Data Collector application.

    For information on accessing the tenant ID from the Azure portal, see the Azure documentation.

    Available when Endpoint Type is set to Endpoint URL.

    Application Key Authentication key or client secret for the Azure Active Directory application. Also known as the client secret.

    For information on accessing the application key from the Azure portal, see the Azure documentation.

    Available when using the OAuth with Service Principal authentication method.

    Account Shared Key Shared access key that Azure generated for the storage account.

    For more information on accessing the shared access key from the Azure portal, see the Azure documentation.

    Available when using the Shared Key authentication method for Azure.

    Azure SAS Token Azure SAS Token.

    To secure sensitive information, you can use runtime resources or credential stores.

    Available when using SAS Token authentication for Azure Blob Storage.

    Use Custom Blob Service URL Enables using a custom Azure Blob Storage URL.

    Available when using the Azure Blob Storage storage location.

    Custom Blob Service URL Custom Azure Blob Storage URL. Typically uses the following format:
    https://<Azure Account>.blob.core.windows.net
    Project ID

    Google Cloud project ID to use.

    Available when using the Google Cloud Storage staging location.

    Credentials Provider Provider for Google Cloud credentials:
    • Default credentials provider - Uses Google Cloud default credentials.
    • Service account credentials file (JSON) - Uses credentials stored in a JSON service account credentials file.
    • Service account credentials (JSON) - Uses JSON-formatted credentials information from a service account credentials file.

    Available when using the Google Cloud Storage staging location.

    Credentials File Path (JSON) Path to the Google Cloud service account credentials file used to connect. The credentials file must be a JSON file.

    Enter a path relative to the Data Collector resources directory, $SDC_RESOURCES, or enter an absolute path.

    Available when using the Google Cloud Storage staging location.

    Credentials File Content (JSON) Contents of a Google Cloud service account credentials JSON file used to connect.

    Enter JSON-formatted credential information in plain text, or use an expression to call the information from runtime resources or a credential store.

    Available when using the Google Cloud Storage staging location.

    FastLoad Sessions Number of sessions to configure for Teradata FastLoad CSV files. Enter 0 for no limit. For more information on the Teradata FastLoad utility, see the Teradata documentation.

    Default is 0.

    Available when using the Local FileSystem staging location.

    Files Directory A directory local to Data Collector on which to stage files. Enter an absolute path.

    Available when using the Local FileSystem staging location.

    Files Prefix Prefix to use for output files. Use when writing to a directory that receives files from other sources.

    Uses the prefix sdc-${sdc:id()} by default. The prefix evaluates to sdc-<Data Collector ID>.

    The Data Collector ID is stored in the following file: $SDC_DATA/sdc.id. For more information about environment variables, see Data Collector Environment Configuration.

    Stage File Name Prefix Optional prefix for the external stage name.
    Purge Stage File After Ingesting Removes a stage file after its data is written to Teradata.
    Compressed File Compresses files before writing them to the staging location. Keep this option enabled for optimum performance.
  4. When using an Amazon S3 staging location, on the Staging Advanced tab, configure the following properties:
    Amazon S3 Advanced Property Description
    Connection Timeout Seconds to wait for a response before closing the connection.
    Socket Timeout Seconds to wait for a response to a query.
    Retry Count Maximum number of times to retry requests.
    Use Proxy Specifies whether to use a proxy to connect.
    Proxy Host Proxy host.
    Proxy Port Proxy port.
    Proxy User Proxy user.
    Proxy Password Proxy password.
    Proxy Domain Optional domain name for the proxy server.
    Proxy Workstation Optional workstation for the proxy server.
    Uploading Threads Size of the thread pool for parallel uploads. Used when working with multiple partitions and processing large objects in multiple parts.

    When working with multiple partitions, setting this property up to the number of partitions being used to can improve performance.

    For more information about this and the following properties, see the Amazon S3 TransferManager documentation.

    Minimum Upload Part Size (MB) Minimum part size in bytes for multipart uploads.
    Multipart Upload Threshold (MB) Minimum batch size in bytes for multipart uploads.
    Encryption Option that Amazon S3 uses to manage the encryption keys:
    • None
    • SSE-S3 - Use Amazon S3-managed keys.
    • SSE-KMS - Use Amazon Web Services KMS-managed keys.

    Default is None.

    Encryption KMS ID Amazon resource name (ARN) of the AWS KMS master encryption key. Use the following format:
    <arn>:<aws>:<kms>:<region>:<acct ID>:<key>/<key ID>

    Used for SSE-KMS encryption only.

    Encryption Context Key-value pairs to use for the encryption context. Click Add to add key-value pairs.

    Used for SSE-KMS encryption only.

    Tags List of Amazon S3 tags to add to the staged objects. You can use object tagging to categorize storage and maintain file metadata such as Data Classification.

    Each tag is a key-value pair.

    For more information, see the Amazon S3 documentation.

  5. On the Data tab, configure the following properties:
    Data Property Description
    Staging File Format Format for staging files. Select one of the following options:
    Row Field Map or list-map field to use as the basis for the generated row. Default is /, which includes all record fields in the resulting row.
    Column Fields to Ignore A list of fields to ignore when writing to the destination. You can enter a comma-separated list of first-level fields to ignore.
    Merge CDC Data Enables performing CRUD operations and using the MERGE command to write to Teradata tables. Select to process CDC data.
    Important: To maintain the original order of data, do not use multiple threads or cluster execution mode when processing CDC data.

    For more information about the MERGE command and other load methods, see Load Methods. For information about optimizing pipeline performance, see Performance Optimization.

    Primary Key Location Location of primary key details:
    • JDBC Header - Use when primary key columns are defined in the jdbc.primaryKeySpecification record header attribute.
    • Specify for Each Table - Use to list each target table and the primary key columns for the table. This is the most efficient option when key columns are known.

    Available when using the MERGE command to process CDC data.

    Table Key Columns Primary key columns to use for each Teradata table. Click the Add icon to define primary key columns for additional tables.

    Click the Add icon in the Key Column field to add additional key columns for a table.

    Available when Primary Key Location is set to Specify for Each Table.

  6. For Parquet staging file format, on the Data tab, configure the following properties:
    Parquet Property Description
    Parquet Schema Location Location to load the Parquet schema from:
    • Parquet Schema in Record Header - Uses the schema in the parquetSchema record header attribute.
    • In Pipeline Configuration - Uses the schema you provide in the stage configuration.
    • Infer from records - The destination infers the schema from the first record of the first batch of data.
    Nullable Fields Allow null values in schema fields.

    Default is enabled.

    Parquet Compression Codec Compression type for output files:
    • Uncompressed
    • GZIP
    • LZO
    • Snappy
    Parquet Format Version Parquet file format version to use.
    Parquet Row Group Size (bytes) Maximum size in bytes of a row group being buffered in memory.

    The size of a row group is defined by the batch size, but you can limit it to restrict the memory usage for the write.

    Parquet Page Size (bytes) Page size in bytes for compressed pages.
    Parquet Max Padding Size (bytes) Maximum size in bytes of allowed padding to align rows.
    Enable Parquet Dictionary Encoding Enables Parquet dictionary encoding.

    There is one dictionary page per column per group when dictionary encoding is enabled.

    Parquet Dictionary Page Size (bytes) Page size in bytes for dictionary pages.
  7. On the Data Advanced tab, configure the following properties:
    Data Advanced Property Description
    Ignore Missing Fields Allows writing records with missing fields to tables. Uses the specified default value for the data type of the missing field.

    When not enabled, records with missing fields are treated as error records.

    Ignore Fields with Invalid Types Allows replacing fields that contain data of an invalid type with the specified default value for the data type.

    When not enabled, records with data of invalid types are treated as error records.

    Numeric Types Default Default value to use when replacing missing fields or fields with invalid data that use one of the numeric data types. Numeric data types include Smallint, Integer, and Bigint.
    Float Default Default value to use when replacing missing Float fields or Float fields with invalid data.
    Decimal Default Default value to use when replacing missing Decimal fields or Decimal fields with invalid data.
    Time Default Default value to use when replacing missing Time fields or Time fields with invalid data.
    Date Default Default value to use when replacing missing Date fields or Date fields with invalid data.
    Timestamp Default Default value to use when replacing missing Timestamp fields or Timestamp fields with invalid data.
    String Default Default value to use when replacing missing String fields or String fields with invalid data.
    Character Default Default value to use when replacing missing Character fields or Character fields with invalid data.
    Binary Default Default value to use when replacing missing Binary fields or Binary fields with invalid data.
    Interval Default Default value to use when replacing missing Interval fields or Interval fields with invalid data.
    Replace Newlines Replaces newline characters in string fields with the specified replacement character.
    Newline Replacement Character Character to use to replace newline characters.
    Trim Spaces Trims leading and trailing spaces from field data.