Hive Metastore

Supported pipeline types:
  • Data Collector

The Hive Metastore destination works with the Hive Metadata processor and the Hadoop FS or MapR FS destination as part of the Drift Synchronization Solution for Hive. For information about supported versions, see Supported Systems and VersionsSupported Systems and Versions in the Data Collector documentation.

The Hive Metastore destination uses metadata records generated by the Hive Metadata processor to create and update Hive tables. This enables the Hadoop FS and MapR FS destinations to write drifting Avro or Parquet data to HDFS or MapR FS.

The Hive Metastore destination compares information in metadata records with Hive tables, and then creates or updates the tables as needed. For example, when the Hive Metadata processor encounters a record that requires a new Hive table, it passes a metadata record to the Hive Metastore destination and the destination creates the table.

Hive table names, column names, and partition names are created with lowercase letters. Names that include uppercase letters become lowercase in Hive.

Note that the Hive Metastore destination does not process data. It processes only metadata records generated by the Hive Metadata processor and must be downstream from the processor's metadata output stream.

When you configure Hive Metastore, you define the connection information for Hive, the location of the Hive and Hadoop configuration files and optionally specify additional required properties. You can also enable Kerberos authentication. You can also set a maximum cache size for the destination, determine how new tables are created and stored, and configure custom record header attributes.

The destination can also generate events for an event stream. For more information about the event framework, see Dataflow Triggers Overview.

Important: When using the destination in multiple pipelines, take care to avoid concurrent or conflicting writes to the same tables.

For more information about the Drift Synchronization Solution for Hive and case studies for processing Avro and Parquet data, see Drift Synchronization Solution for Hive. For a tutorial, check out our tutorial on GitHub.