Kinesis Firehose

Supported pipeline types:
  • Data Collector

  • Data Collector Edge

The Kinesis Firehose destination writes data to an Amazon Kinesis Firehose delivery stream. Firehose automatically delivers the data to the Amazon S3 bucket or Amazon Redshift table that you specify in the delivery stream. For information about supported versions, see Supported Systems and Versions.

To write data to Amazon Kinesis Streams, use the Kinesis Producer destination. To write data directly to Amazon S3, use the Amazon S3 destination.

When you use the Kinesis Firehose destination to deliver data to Amazon S3, Firehose can buffer incoming records into larger file sizes before delivering the data to Amazon S3. You configure the buffer size and buffer interval when you create the delivery stream.

When you configure the Kinesis Firehose destination, you specify an existing delivery stream to write to, AWS credentials and region, and the data format to use.

Authentication Method

You can configure the Kinesis Firehose destination to authenticate with Amazon Web Services (AWS) using an instance profile or AWS access keys.

For more information about the authentication methods and details on how to configure each method, see Security in Amazon Stages.

Delivery Stream

The Kinesis Firehose destination writes data to an existing delivery stream in Amazon Kinesis Firehose. Before using the Kinesis Firehose destination, use the AWS Management Console to create a delivery stream to an Amazon S3 bucket or Amazon Redshift table.

For more information about creating a Firehose delivery stream, see the Amazon Kinesis Firehose documentation.

Data Formats

The Kinesis Firehose destination writes data to a Kinesis Firehose delivery stream based on the data format that you select.

In Data Collector Edge pipelines, the destination supports only the JSON data format.

The Kinesis Firehose destination processes data formats as follows:

Delimited
The destination writes records as delimited data. When you use this data format, the root field must be list or list-map.
You can use the following delimited format types:
  • Default CSV - File that includes comma-separated values. Ignores empty lines in the file.
  • RFC4180 CSV - Comma-separated file that strictly follows RFC4180 guidelines.
  • MS Excel CSV - Microsoft Excel comma-separated file.
  • MySQL CSV - MySQL comma-separated file.
  • Tab-Separated Values - File that includes tab-separated values.
  • PostgreSQL CSV - PostgreSQL comma-separated file.
  • PostgreSQL Text - PostgreSQL text file.
  • Custom - File that uses user-defined delimiter, escape, and quote characters.
  • Multi Character Delimited - File that uses multiple user-defined characters to delimit fields and lines, and single user-defined escape and quote characters.
JSON
The destination writes records as JSON data. Use the multiple objects format, where each file includes multiple JSON objects. Each object is a JSON representation of a record.
Note: The JSON array of objects format is not supported for the Kinesis Firehose destination.

Configuring a Kinesis Firehose Destination

Configure a Kinesis Firehose destination to write data to an Amazon Kinesis Firehose delivery stream.

  1. In the Properties panel, on the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
    Required Fields Fields that must include data for the record to be passed into the stage.
    Tip: You might include fields that the stage uses.

    Records that do not include all required fields are processed based on the error handling configured for the pipeline.

    Preconditions Conditions that must evaluate to TRUE to allow a record to enter the stage for processing. Click Add to create additional preconditions.

    Records that do not meet all preconditions are processed based on the error handling configured for the stage.

    On Record Error Error record handling for the stage:
    • Discard - Discards the record.
    • Send to Error - Sends the record to the pipeline for error handling.
    • Stop Pipeline - Stops the pipeline. Not valid for cluster pipelines.
  2. On the Kinesis tab, configure the following properties:
    Kinesis Property Description
    Authentication Method Authentication method used to connect to Amazon Web Services (AWS):
    • AWS Keys - Authenticates using an AWS access key pair.
    • Instance Profile - Authenticates using an instance profile associated with the Data Collector EC2 instance.
    Access Key ID AWS access key ID. Required when using AWS keys to authenticate with AWS.
    Secret Access Key AWS secret access key. Required when using AWS keys to authenticate with AWS.
    Tip: To secure sensitive information such as access key pairs, you can use runtime resources or credential stores.
    Assume Role Temporarily assumes another role to authenticate with AWS.
    Role ARN

    Amazon resource name (ARN) of the role to assume, entered in the following format:

    arn:aws:iam::<account_id>:role/<role_name>

    Where <account_id> is the ID of your AWS account and <role_name> is the name of the role to assume. You must create and attach an IAM trust policy to this role that allows the role to be assumed.

    Available when assuming another role.

    Role Session Name

    Optional name for the session created by assuming a role. Overrides the default unique identifier for the session.

    Available when assuming another role.

    Session Timeout

    Maximum number of seconds for each session created by assuming a role. The session is refreshed if the pipeline continues to run for longer than this amount of time.

    Set to a value between 3,600 seconds and 43,200 seconds.

    Available when assuming another role.

    Set Session Tags

    Sets a session tag to record the name of the currently logged in StreamSets user that starts the pipeline or the job for the pipeline. AWS IAM verifies that the user account set in the session tag can assume the specified role.

    Select only when the IAM trust policy attached to the role to be assumed uses session tags and restricts the session tag values to specific user accounts.

    When cleared, the connection does not set a session tag.

    Available when assuming another role.

    Region AWS region to connect to. Select one of the available regions. To specify an endpoint to connect to, select Other.
    Endpoint Endpoint to connect to when you select Other for the region. Enter the endpoint name.
    Stream Name Existing delivery stream to write to.

    Use the AWS Management Console to create the delivery stream to an Amazon S3 bucket or Amazon Redshift table.

    Destination Type Type of Amazon destination to write to. Select Existing Stream.
    Maximum Record Size (KB) Maximum size of a single record. When records exceed this size, the destination handles the records based on the error record handling configured for the stage.
    Warning: A Firehose record can have a maximum size of 1,000 KB. If you configure a maximum size larger than 1,000 KB, Firehose does not accept any data written by the destination.
  3. On the Data Format tab, configure the following property:
    Data Format Property Description
    Data Format Data format to use. Use one of the following data formats:
    • Delimited
    • JSON

    In Data Collector Edge pipelines, the destination supports only the JSON data format.

  4. For delimited data, on the Data Format tab, configure the following properties:
    Delimited Property Description
    Delimiter Format Format for delimited data:
    • Default CSV - File that includes comma-separated values. Ignores empty lines in the file.
    • RFC4180 CSV - Comma-separated file that strictly follows RFC4180 guidelines.
    • MS Excel CSV - Microsoft Excel comma-separated file.
    • MySQL CSV - MySQL comma-separated file.
    • Tab-Separated Values - File that includes tab-separated values.
    • PostgreSQL CSV - PostgreSQL comma-separated file.
    • PostgreSQL Text - PostgreSQL text file.
    • Custom - File that uses user-defined delimiter, escape, and quote characters.
    Header Line Indicates whether to create a header line.
    Delimiter Character Delimiter character for a custom delimiter format. Select one of the available options or use Other to enter a custom character.

    You can enter a Unicode control character using the format \uNNNN, where ​N is a hexadecimal digit from the numbers 0-9 or the letters A-F. For example, enter \u0000 to use the null character as the delimiter or \u2028 to use a line separator as the delimiter.

    Default is the pipe character ( | ).

    Record Separator String Characters to use to separate records. Use any valid Java string literal. For example, when writing to Windows, you might use \r\n to separate records.

    Available when using a custom delimiter format.

    Escape Character Escape character for a custom delimiter format. Select one of the available options or use Other to enter a custom character.

    Default is the backslash character ( \ ).

    Quote Character Quote character for a custom delimiter format. Select one of the available options or use Other to enter a custom character.

    Default is the quotation mark character ( " ).

    Replace New Line Characters Replaces new line characters with the configured string.

    Recommended when writing data as a single line of text.

    New Line Character Replacement String to replace each new line character. For example, enter a space to replace each new line character with a space.

    Leave empty to remove the new line characters.

    Charset Character set to use when writing data.
  5. For JSON data, on the Data Format tab, configure the following property:
    JSON Property Description
    JSON Content Determines how JSON data is written. Select Multiple JSON Objects. Each file includes multiple JSON objects. Each object is a JSON representation of a record.
    Note: The JSON array of objects format is not supported for the Kinesis Firehose destination.
    Charset Character set to use when writing data.