Hive Metastore

The Hive Metastore destination works with the Hive Metadata processor and the Hadoop FS or MapR FS destination as part of the Drift Synchronization Solution for Hive. For information about supported versions, see Supported Systems and Versions in the Data Collector documentation.

The Hive Metastore destination uses metadata records generated by the Hive Metadata processor to create and update Hive tables. This enables the Hadoop FS and MapR FS destinations to write drifting Avro or Parquet data to HDFS or MapR FS.
Note: Starting with Cloudera CDP 7.1.4, Hive-managed internal tables can include ORC data only. As a result, when using Cloudera CDP 7.1.4 or later, the Hive Metastore destination cannot create or update managed internal tables. The destination can create or update external tables only.

The Hive Metastore destination compares information in metadata records with Hive tables, and then creates or updates the tables as needed. For example, when the Hive Metadata processor encounters a record that requires a new Hive table, it passes a metadata record to the Hive Metastore destination and the destination creates the table.

Hive table names, column names, and partition names are created with lowercase letters. Names that include uppercase letters become lowercase in Hive.

Note that the Hive Metastore destination does not process data. It processes only metadata records generated by the Hive Metadata processor and must be downstream from the processor's metadata output stream.

When you configure Hive Metastore, you define the connection information for Hive, the location of the Hive and Hadoop configuration files and optionally specify additional required properties. You can also use a connection to configure the destination.

You can enable Kerberos authentication. You can also set a maximum cache size for the destination, determine how new tables are created and stored, and configure custom record header attributes.

The destination can also generate events for an event stream. For more information about the event framework, see Dataflow Triggers Overview.

Important: When using the destination in multiple pipelines, take care to avoid concurrent or conflicting writes to the same tables.

For more information about the Drift Synchronization Solution for Hive and case studies for processing Avro and Parquet data, see Drift Synchronization Solution for Hive. For a tutorial, check out our tutorial on GitHub.

Metadata Processing

When processing records, Hive Metastore destination performs the following tasks:

Creates or updates Hive tables as needed
  1. For each metadata record that includes a request to create or update a table, the destination checks Hive for the table.
  2. If the table in the metadata record does not exist or differs from the Hive table, the destination creates or updates the table as needed.

    Hive table names, column names, and partition names are created with lowercase letters. For more information, see Hive Names and Supported Characters.

Note: The destination can create tables and partitions. It can add columns to tables and ignore existing columns. It does not drop existing columns from tables.
Creates new Avro schemas as needed
When the Drift Synchronization Solution for Hive processes Avro data, you can configure the Hive Metastore destination to generate Avro schemas. Under these conditions, the Hive Metastore performs the following tasks:
  1. For each metadata record that includes a schema change, the destination checks Hive for the current set of columns for the specified table.
  2. When there are compatible differences, the destination generates a new Avro schema that incorporates the differences.

    This can occur when a separate entity changes the target table in the moments between evaluation by the Hive Metadata processor and the Hive Metastore destination.

Hive Table Generation

When the Drift Synchronization Solution for Hive processes Parquet data, the destination uses the Stored as Parquet clause when generating the table so it does not need to generate a new schema for each change.

When the Drift Synchronization Solution for Hive processes Avro data, the Hive Metastore destination can generate Hive tables using the following methods:
With the Stored As Avro clause
Generates the table with a query that includes the Stored As Avro clause. When using the Stored As Avro clause, the destination does not need to generate an Avro schema for each change in the Hive table.
This is the default and recommended method for table generation. Enable the Stored As Avro property to use this method.
Without the Stored As Avro clause
Generates the table without a Stored As Avro clause in the query. Instead, the destination generates an Avro schema for each Hive table update. The destination uses the following format for the schema name: avro_schema_<database>_<table>_<UUID>.avsc.
The destination stores the Avro schema in HDFS. You can configure where the destination saves the schemas. You can specify a full path or a path relative to the table directory. By default, the destination saves the schema in a .schemas subfolder of the table directory.
You can configure the destination to generate and store the schemas as a specified HDFS user. Data Collector must be configured as a proxy user in HDFS.

Cache

The Hive Metastore destination queries Hive for information and caches the results. When possible, it uses the cache to avoid unnecessary Hive queries.

The destination caches the following Hive metadata:
  • Database and table to be written to
  • Hive table properties
  • Column names and types in the table
  • Partition values

Cache Size and Evictions

You can configure the maximum size of the cache. When the cache reaches the specified limit, it uses the LRU eviction policy, which removes the least recently used data to allow for new entries to be written to the cache.

Event Generation

The Hive Metastore destination can generate events that you can use in an event stream. When you enable event generation, the destination creates event records each time it updates the Hive metastore, including when it creates a table, adds columns, or creates a partition. It also creates events when it generates and writes a new Avro schema file to the destination system.
Note: Updates to existing tables, columns and partitions are treated as creates since the destination does not change or delete existing structures.
Hive Metastore events can be used in any logical way. For example:

For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.

Event Records

Hive Metastore event records include the following event-related record header attributes. Record header attributes are stored as String values:

Record Header Attribute Description
sdc.event.type Event type. Uses one of the following types:
  • new-table - Generated when the destination creates a new table.
  • new-columns - Generated when the destination creates new columns.
  • new partition - Generated when the destination creates a new partition.
  • avro-schema-store - Generated when the destination generates and writes a new Avro schema file to the destination system.
sdc.event.version Integer that indicates the version of the event record type.
sdc.event.creation_timestamp Epoch timestamp when the stage created the event.
Hive Metastore can generate the following types of event records:
New table

The destination generates a new table event record when it creates a new table.

New table event records have the sdc.event.type record header attribute set to new-table and include the following field:
Field Description
table Fully qualified table name using the following format: '<DB>'.'<table>'.
columns A list-map field with the following information for each new column:
  • Column name
  • Hive data type, precision and scale
For example:
    • key: `id`
    • value: INT
    • key: `desc`
    • value: STRING
partitions A list-map field with the following information:
  • Partition name
  • Partition value
For example:
    • name: `dt`
    • value: 2017-01-01
New column

The destination generates a new column event record when it creates new columns in a table.

New column event records have the sdc.event.type record header attribute set to new-column and include the following fields:
Field Description
table Fully qualified table name using the following format: <db>.<table>.
columns A list-map field with the following information for each new column:
  • Column name
  • Hive data type, precision and scale
New partition

The destination generates a new partition event record when it creates a new partition in a table.

New partition event records have the sdc.event.type record header attribute set to new-partition and include the following fields:
Field Description
table Fully qualified table name using the following format: <db>.<table>.
partitions A list-map field with the following information:
  • Partition name
  • Partition value
New Avro schema file
When the Drift Synchronization Solution for Hive processes Avro data and the Stored as Avro option in the destination is not enabled, the destination generates and writes an Avro schema file event each time it creates or updates a table.
New Avro schema files have the sdc.event.type record header attribute set to avro-schema-store and include the following fields:
Field Description
table Fully qualified table name using the following format: <db>.<table>.
avro_schema The new Avro schema.
schema_location Location where the schema file was written.

Kerberos Authentication

When you use Kerberos authentication, Data Collector uses the Kerberos principal and keytab to connect to HiveServer2. By default, Data Collector uses the user account who started it to connect.

The Kerberos principal and keytab are defined in the Data Collector configuration file, $SDC_CONF/sdc.properties. To use Kerberos authentication, configure all Kerberos properties in the Data Collector configuration file and include the Kerberos principal in the HiveServer2 JDBC URL.

For more information about enabling Kerberos authentication for Data Collector, see Kerberos Authentication in the Data Collector documentation.

Hive Properties and Configuration Files

You must configure Hive Metastore to use Hive and Hadoop configuration files and individual properties.

Configuration Files
The following configuration files are required for the Hive Metastore destination:
  • core-site.xml
  • hdfs-site.xml
  • hive-site.xml
Individual properties
You can configure individual Hive properties in the destination. To add a Hive property, specify the exact property name and the value. The processor does not validate the property names or values.
Note: Individual properties override properties defined in the configuration files.

Hive Data Types

The following table lists Data Collector data types and the corresponding Hive data types. The Hive Metadata processor uses these conversions when generating metadata records. The Hive Metadata destination uses these conversions when generating Hive CREATE TABLE and ALTER TABLE statements.

Data Collector Data Type Hive Data Type
Boolean Boolean
Byte Not supported
Char String
Date Date
Datetime String
Decimal Decimal
Double Double
Float Float
Integer Int
Long Bigint
List Not supported
List-Map Not supported
Map Not supported
Short Int
String String
Time String

Configuring a Hive Metastore Destination

Configure a Hive Metastore destination to process metadata records from the Hive Metadata processor and update Hive tables and Avro schemas as needed.

  1. In the Properties panel, on the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
    Stage Library Library version that you want to use.
    Produce Events Generates event records when events occur. Use for event handling.
    Required Fields Since the destination processes only metadata records, this property is not relevant.
    Preconditions Since the destination processes only metadata records, this property is not relevant.
    On Record Error Error record handling for the stage:
    • Discard - Discards the record.
    • Send to Error - Sends the record to the pipeline for error handling.
    • Stop Pipeline - Stops the pipeline.
  2. On the Hive tab, configure the following properties:
    Hive Property Description
    Connection Connection that defines the information required to connect to an external system.

    To connect to an external system, you can select a connection that contains the details, or you can directly enter the details in the pipeline. When you select a connection, Control Hub hides other properties so that you cannot directly enter connection details in the pipeline.

    JDBC URL

    JDBC URL for Hive. For details about specifying the URL, see our StreamSets Community post.

    You can optionally include the user name and password in the JDBC URL. If you include a password with special characters, you must URL-encode (also called percent-encoding) the special characters. Otherwise errors will occur when validating or running your pipeline. For example, if your JDBC URL looks like this:
    jdbc:hive2://sunnyvale:12345/default;user=admin;password=a#b!c$e
    URL-encode your password so that your JDBC URL looks like this:
    jdbc:hive2://sunnyvale:12345/default;user=admin;password=a%23b%21c%24e
    Tip: To secure sensitive information, you can use runtime resources or credential stores. For more information about credential stores, see Credential Stores in the Data Collector documentation.
    JDBC Driver Name The fully-qualified JDBC driver name.
    Use Credentials Enables entering credentials in properties. Use when you do not include credentials in the JDBC URL.
    Note: To impersonate the current user in connections to Hive, you can edit the Data Collector configuration file, $SDC_CONF/sdc.properties to configure Data Collector to automatically impersonate the user without specifying credentials in the pipeline. See Configuring Data Collector in the Data Collector documentation.
    Username User name for the JDBC connection.

    The user account must have the correct permissions or privileges in the database.

    Password Password for the JDBC user name.
    Tip: To secure sensitive information such as user names and passwords, you can use runtime resources or credential stores. For more information about credential stores, see Credential Stores in the Data Collector documentation.
    Hadoop Configuration Directory

    Absolute path to the directory containing the following Hive and Hadoop configuration files:

    • core-site.xml
    • hdfs-site.xml
    • hive-site.xml

    For a Cloudera Manager installation, enter hive-conf.

    Note: Properties in the configuration files are overridden by individual properties defined in the Additional Hadoop Configuration property.

    Additional Hadoop Configuration

    Additional properties to use.

    Using simple or bulk edit mode, click Add to add additional properties and define the property name and value. Use the property names and values as expected by Hive and HDFS or MapR FS.

    Additional JDBC Configuration Properties Additional JDBC configuration properties to pass to the JDBC driver.

    Using simple or bulk edit mode, click Add to add additional properties and define the property name and value. Use the property names and values as expected by the JDBC driver.

  3. On the Advanced tab, optionally configure the following properties:
    Advanced Property Description
    Stored as Avro Uses the Stored As Avro clause in the SQL commands that generate Hive tables when the Drift Synchronization Solution for Hive processes Avro data. When selected, the Avro schema URL is not included in the query.
    Schema Folder Location Location to store the Avro schemas when Stored as Avro is not selected.

    Use a leading slash ( / ) to specify a fully-qualified path. Omit the slash to specify a path relative to the table directory.

    When not specified, the schemas are stored in the .schema sub-folder of the table directory.

    HDFS User

    Optional HDFS user for the destination to use when generating schemas.

    Header Attribute Expressions When generating events, adds the specified header attributes to the event record.

    Using simple or bulk edit mode, click the Add icon to include custom record header attributes. Then, enter the name and value for the attribute.

    You can use expressions to define the name and value to use.
    Max Cache Size (entries) Maximum number of entries in the cache.

    When the cache reaches the maximum size, the oldest cached entries are evicted to allow for new data.

    Default is -1, an unlimited cache size.