HBase

Supported pipeline types:
  • Data Collector

The HBase destination writes data to an HBase cluster. The destination can write data to an existing HBase table as text, binary data, or JSON strings. You can define the data format for each column written to HBase. For information about supported versions, see Supported Systems and Versions.

When you configure the HBase destination, you specify the HBase configuration properties, including the ZooKeeper Quorum, parent znode, and the existing table name. You specify the row key for the table, and then map fields from the pipeline to HBase columns.

When necessary, you can enable Kerberos authentication and specify an HBase user. You can also configure a time basis and add additional HBase configuration properties.

Field Mappings

When you configure the HBase destination, you map fields from records to HBase columns.

You can map fields to columns in the following ways:

Explicit field mappings
By default, the HBase destination uses explicit field mappings. You select the fields from records to map to HBase columns. Specify the HBase columns using the following format: <column-family>:<qualifier>. You then define the storage type for the column in HBase.
When you use explicit field mappings, you can configure the destination to ignore missing field paths. If the destination encounters a mapped field path that doesn’t exist in the record, the destination ignores the missing field path and writes the remaining fields in the record to HBase.
Implicit field mappings
When you configure the HBase destination to use implicit field mappings, the destination writes data based on the matching field names. You can use implicit field mappings when the field paths use the following format:
<column-family>:<qualifier>
For example, if a field path is "cf:a", the destination can implicitly map the field to the HBase table with the column family "cf" and the qualifier "a".
When you use implicit field mappings, you can configure the destination to ignore invalid columns. If the destination encounters a field path that cannot be mapped to a valid HBase column, the destination ignores the invalid column and writes the remaining fields in the record to HBase.
Both implicit and explicit field mappings
You can configure the destination to use implicit field mappings and then you can override the mappings by defining explicit mappings for specific fields.
For example, a record might contain some field paths that use the <column-family>:<qualifier> format and other field paths that don’t use the required format. You can add explicit field mappings for the field paths that do not use the required format. Or, you can use explicit field mappings for fields that use the required format, but need to be written to a different column.

Kerberos Authentication

You can use Kerberos authentication to connect to HBase. When you use Kerberos authentication, Data Collector uses the Kerberos principal and keytab to connect to HBase. By default, Data Collector uses the user account who started it to connect.

The Kerberos principal and keytab are defined in the Data Collector configuration file, $SDC_CONF/sdc.properties. To use Kerberos authentication, configure all Kerberos properties in the Data Collector configuration file.

For more information about enabling Kerberos authentication for Data Collector, see Kerberos Authentication.

Using an HBase User

Data Collector can either use the currently logged in Data Collector user or a user configured in the destination to write to HBase.

A Data Collector configuration property can be set that requires using the currently logged in Data Collector user. When this property is not set, you can specify a user in the origin. For more information about Hadoop impersonation and the Data Collector property, see Hadoop Impersonation Mode.

Note that the destination uses a different user account to connect to HBase. By default, Data Collector uses the user account who started it to connect to external systems. When using Kerberos, Data Collector uses the Kerberos principal.

To configure a user in the destination to write to HBase, perform the following tasks:
  1. On HBase, configure the user as a proxy user and authorize the user to impersonate the HBase user.

    For more information, see the HBase documentation.

  2. In the HBase destination, enter the HBase user name.

Time Basis

The time basis determines the timestamp value added for each column written to HBase.

You can use the following times as the time basis:

Processing Time
When you use processing time as the time basis, the destination uses the Data Collector processing time as the timestamp value. The processing time is calculated once per batch.
To use the processing time as the time basis, use the following expression: ${time:now()}.
Record Time
When you use the time associated with a record as the time basis, you specify a Date or Datetime field in the record. The destination uses the field value as the timestamp value.
To use a time associated with the record, use an expression that calls a field and resolves to a date or datetime value, such as ${record:value("/Timestamp")}.
System Time
When you leave the Time Basis field empty, the destination uses the timestamp value automatically generated by HBase when the column is written to HBase.
This is the default time basis.

HDFS Properties and Configuration File

You can configure the HBase destination to use individual HDFS properties or HDFS configuration files:

HBase configuration file
You can use the following HDFS configuration file with the HBase configuration file:
  • hbase-site.xml
To use HDFS configuration files:
  1. Store the files or a symlink to the files in the Data Collector resources directory.
  2. In the HBase destination, specify the location of the files.
    Note: For a Cloudera Manager installation, Data Collector automatically creates a symlink to the files named hbase-conf. Enter hbase-conf for the location of the files in the HBase destination.
Individual properties
You can configure individual HBase properties in the HBase destination. To add an HBase property, you specify the exact property name and the value. The HBase destination does not validate the property names or values.
Note: Individual properties override properties defined in the HBase configuration file.

Configuring an HBase Destination

Configure an HBase destination to write data to HBase.

  1. In the Properties panel, on the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
    Stage Library Library version that you want to use.
    Required Fields Fields that must include data for the record to be passed into the stage.
    Tip: You might include fields that the stage uses.

    Records that do not include all required fields are processed based on the error handling configured for the pipeline.

    Preconditions Conditions that must evaluate to TRUE to allow a record to enter the stage for processing. Click Add to create additional preconditions.

    Records that do not meet all preconditions are processed based on the error handling configured for the stage.

    On Record Error Error record handling for the stage:
    • Discard - Discards the record.
    • Send to Error - Sends the record to the pipeline for error handling.
    • Stop Pipeline - Stops the pipeline.
  2. On the HBase tab, configure the following properties:
    HBase Property Description
    ZooKeeper Quorum Comma-separated list of servers in the ZooKeeper quorum. Use the following format:
    <host>.<domain>.com

    To ensure a connection, enter additional broker URIs.

    ZooKeeper Client Port Port number clients use to connect to the ZooKeeper servers.
    ZooKeeper Parent Znode Root node that contains all znodes used by the HBase cluster.
    Table Name Name of the HBase table to use. Enter a table name or a namespace and table name as follows: <namespace>.<tablename>.

    If you do not enter a table name, HBase uses the default namespace.

    Row Key Field in the record that acts as the row key in the HBase table.
    Storage Type Storage type of the row key.
    Fields Explicitly map fields from records to HBase columns, and then define the storage type for the column in HBase.

    Using simple or bulk edit mode, click the Add icon to create additional explicit field mappings.

    Ignore Missing Field Ignores missing field paths. Used when you define explicit field mappings.

    If selected and the destination encounters a mapped field path that doesn’t exist in the record, the destination ignores the missing field path and writes the remaining fields in the record to HBase. If cleared and the destination encounters a mapped field path that doesn't exist in the record, the record is sent to the stage for error handling.

    Ignore Invalid Column Ignores invalid columns. Used when you configure implicit field mappings.

    If selected and the destination encounters a field path that cannot be mapped to a valid HBase column, the destination ignores the invalid column and writes the remaining fields in the record to HBase. If cleared and the destination encounters an invalid column, the record is sent to the stage for error handling.

    Implicit Field Mapping Uses implicit field mappings so that the destination writes data to HBase columns based on the matching field names. The field paths must use the following format:
    <column-family>:<qualifier>
    Kerberos Authentication Uses Kerberos credentials to connect to HBase.

    When selected, uses the Kerberos principal and keytab defined in the Data Collector configuration file, $SDC_CONF/sdc.properties.

    Validate Table Existence Validates that the HBase table exists before writing to the table.

    When selected, the destination obtains a table descriptor to determine if the table exists and to validate the column family. This requires that the HBase user has HBase administrator rights.

    You might want to clear this property when you do not want to grant HBase administrator rights to the HBase user. If you configure the destination to skip validation and a table does not exist, then the pipeline encounters an error.

    HBase User The HBase user to use to write to HBase. When using this property, make sure HBase is configured appropriately.

    When not configured, the pipeline uses the currently logged in Data Collector user.

    Not configurable when Data Collector is configured to use the currently logged in Data Collector user. For more information, see Hadoop Impersonation Mode.

    Time Basis Time basis to use for the timestamp value added to each column written to HBase. Use one of the following expressions:
    • ${time:now()} - Uses the Data Collector processing time as the time basis.
    • ${record:value(<date field path>)} - Uses the time associated with the record as the time basis.

    Or, leave empty to use the system time automatically generated by HBase as the time basis.

    HBase Configuration Directory Location of the HDFS configuration files.

    For a Cloudera Manager installation, enter hbase-conf. For all other installations, use a directory or symlink within the Data Collector resources directory.

    You can use the following file with HBase:
    • hbase-site.xml
    Note: Properties in the configuration files are overridden by individual properties defined in the stage.
    HBase Configuration

    Additional HBase configuration properties to use.

    To add properties, click Add and define the property name and value. Use the property names and values as expected by HBase.