Hive Metastore
Supported pipeline types:
|
The Hive Metastore destination compares information in metadata records with Hive tables, and then creates or updates the tables as needed. For example, when the Hive Metadata processor encounters a record that requires a new Hive table, it passes a metadata record to the Hive Metastore destination and the destination creates the table.
Hive table names, column names, and partition names are created with lowercase letters. Names that include uppercase letters become lowercase in Hive.
Note that the Hive Metastore destination does not process data. It processes only metadata records generated by the Hive Metadata processor and must be downstream from the processor's metadata output stream.
When you configure Hive Metastore, you define the connection information for Hive, the location of the Hive and Hadoop configuration files and optionally specify additional required properties.
You can enable Kerberos authentication. You can also set a maximum cache size for the destination, determine how new tables are created and stored, and configure custom record header attributes.
The destination can also generate events for an event stream. For more information about the event framework, see Dataflow Triggers Overview.
For more information about the Drift Synchronization Solution for Hive and case studies for processing Avro and Parquet data, see Drift Synchronization Solution for Hive. For a tutorial, check out our tutorial on GitHub.
Metadata Processing
When processing records, Hive Metastore destination performs the following tasks:
- Creates or updates Hive tables as needed
-
- For each metadata record that includes a request to create or update a table, the destination checks Hive for the table.
- If the table in the metadata record does not exist or differs from
the Hive table, the destination creates or updates the table as
needed.
Hive table names, column names, and partition names are created with lowercase letters. For more information, see Hive Names and Supported Characters.
- Creates new Avro schemas as needed
- When the Drift Synchronization Solution for Hive processes Avro data, you can configure the Hive Metastore destination to generate Avro schemas. Under these conditions, the Hive Metastore performs the following tasks:
Hive Table Generation
When the Drift Synchronization Solution for Hive processes Parquet data, the destination uses the Stored as Parquet clause when generating the table so it does not need to generate a new schema for each change.
- With the Stored As Avro clause
- Generates the table with a query that includes the Stored As Avro clause. When using the Stored As Avro clause, the destination does not need to generate an Avro schema for each change in the Hive table.
- Without the Stored As Avro clause
- Generates the table without a Stored As Avro clause in the query. Instead,
the destination generates an Avro schema for each Hive table update. The
destination uses the following format for the schema name:
avro_schema_<database>_<table>_<UUID>.avsc
.
Cache
The Hive Metastore destination queries Hive for information and caches the results. When possible, it uses the cache to avoid unnecessary Hive queries.
- Database and table to be written to
- Hive table properties
- Column names and types in the table
- Partition values
Cache Size and Evictions
You can configure the maximum size of the cache. When the cache reaches the specified limit, it uses the LRU eviction policy, which removes the least recently used data to allow for new entries to be written to the cache.
Event Generation
- With the Hive Query executor to run Hive or Impala queries after updating the
Hive metastore.
For an example, see Automating Impala Metadata Updates for Drift Synchronization for Hive.
- With the HDFS File Metadata executor to move
or change permissions on closed files.
For an example, see Managing Output Files.
- With the Email executor to send a custom email
after receiving an event.
For an example, see Sending Email During Pipeline Processing.
- With a destination to store event information.
For an example, see Preserving an Audit Trail of Events.
For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.
Event Records
Hive Metastore event records include the following event-related record header attributes. Record header attributes are stored as String values:
Record Header Attribute | Description |
---|---|
sdc.event.type | Event type. Uses one of the following types:
|
sdc.event.version | Integer that indicates the version of the event record type. |
sdc.event.creation_timestamp | Epoch timestamp when the stage created the event. |
- New table
-
The destination generates a new table event record when it creates a new table.
New table event records have thesdc.event.type
record header attribute set tonew-table
and include the following field:Field Description table Fully qualified table name using the following format: '<DB>'.'<table>'
.columns A list-map field with the following information for each new column: - Column name
- Hive data type, precision and scale
For example:-
- key: `id`
- value: INT
-
- key: `desc`
- value: STRING
partitions A list-map field with the following information: - Partition name
- Partition value
For example:-
- name: `dt`
- value: 2017-01-01
- New column
-
The destination generates a new column event record when it creates new columns in a table.
New column event records have thesdc.event.type
record header attribute set tonew-column
and include the following fields:Field Description table Fully qualified table name using the following format: <db>.<table>
.columns A list-map field with the following information for each new column: - Column name
- Hive data type, precision and scale
- New partition
-
The destination generates a new partition event record when it creates a new partition in a table.
New partition event records have thesdc.event.type
record header attribute set tonew-partition
and include the following fields:Field Description table Fully qualified table name using the following format: <db>.<table>
.partitions A list-map field with the following information: - Partition name
- Partition value
- New Avro schema file
- When the Drift Synchronization Solution for Hive processes Avro data and the Stored as Avro option in the destination is not enabled, the destination generates and writes an Avro schema file event each time it creates or updates a table.
Kerberos Authentication
When you use Kerberos authentication, Data Collector uses the Kerberos principal and keytab to connect to HiveServer2. By default, Data Collector uses the user account who started it to connect.
The Kerberos principal and keytab are defined in the Data Collector configuration file,
$SDC_CONF/sdc.properties
. To use Kerberos authentication, configure all Kerberos properties in the Data Collector
configuration file and include the Kerberos principal in the HiveServer2 JDBC URL.
For more information about enabling Kerberos authentication for Data Collector, see Kerberos Authentication.
Hive Properties and Configuration Files
You must configure Hive Metastore to use Hive and Hadoop configuration files and individual properties.
- Configuration Files
-
The following configuration files are required for the Hive Metastore destination:
- core-site.xml
- hdfs-site.xml
- hive-site.xml
- Individual properties
- You can configure individual Hive properties in the destination. To add a
Hive property, specify the exact property name and the value. The processor
does not validate the property names or values.Note: Individual properties override properties defined in the configuration files.
Hive Data Types
The following table lists Data Collector data types and the corresponding Hive data types. The Hive Metadata processor uses these conversions when generating metadata records. The Hive Metadata destination uses these conversions when generating Hive CREATE TABLE and ALTER TABLE statements.
Data Collector Data Type | Hive Data Type |
---|---|
Boolean | Boolean |
Byte | Not supported |
Char | String |
Date | Date |
Datetime | String |
Decimal | Decimal |
Double | Double |
Float | Float |
Integer | Int |
Long | Bigint |
List | Not supported |
List-Map | Not supported |
Map | Not supported |
Short | Int |
String | String |
Time | String |
Configuring a Hive Metastore Destination
Configure a Hive Metastore destination to process metadata records from the Hive Metadata processor and update Hive tables and Avro schemas as needed.
-
In the Properties panel, on the General tab, configure the
following properties:
General Property Description Name Stage name. Description Optional description. Stage Library Library version that you want to use. Produce Events Generates event records when events occur. Use for event handling. Required Fields Since the destination processes only metadata records, this property is not relevant. Preconditions Since the destination processes only metadata records, this property is not relevant. On Record Error Error record handling for the stage: - Discard - Discards the record.
- Send to Error - Sends the record to the pipeline for error handling.
- Stop Pipeline - Stops the pipeline.
-
On the Hive tab, configure the following properties:
Hive Property Description JDBC URL JDBC URL for Hive. For details about specifying the URL, see our StreamSets Community post.
You can optionally include the user name and password in the JDBC URL. If you include a password with special characters, you must URL-encode (also called percent-encoding) the special characters. Otherwise errors will occur when validating or running your pipeline. For example, if your JDBC URL looks like this:jdbc:hive2://sunnyvale:12345/default;user=admin;password=a#b!c$e
URL-encode your password so that your JDBC URL looks like this:jdbc:hive2://sunnyvale:12345/default;user=admin;password=a%23b%21c%24e
JDBC Driver Name The fully-qualified JDBC driver name. Use Credentials Enables entering credentials in properties. Use when you do not include credentials in the JDBC URL. Note: To impersonate the current user in connections to Hive, you can edit the Data Collector configuration file,$SDC_CONF/sdc.properties
to configure Data Collector to automatically impersonate the user without specifying credentials in the pipeline. See Configuring Data Collector.Username User name for the JDBC connection. The user account must have the correct permissions or privileges in the database.
Password Password for the JDBC user name. Tip: To secure sensitive information such as user names and passwords, you can use runtime resources or credential stores.Hadoop Configuration Directory Absolute path to the directory containing the following Hive and Hadoop configuration files:
- core-site.xml
- hdfs-site.xml
- hive-site.xml
For a Cloudera Manager installation, enter
hive-conf.
Note: Properties in the configuration files are overridden by individual properties defined in the Additional Hadoop Configuration property.Additional Hadoop Configuration
Additional properties to use.
Using simple or bulk edit mode, click Add to add additional properties and define the property name and value. Use the property names and values as expected by Hive and HDFS or MapR FS.
Additional JDBC Configuration Properties Additional JDBC configuration properties to pass to the JDBC driver. Using simple or bulk edit mode, click Add to add additional properties and define the property name and value. Use the property names and values as expected by the JDBC driver.
-
On the Advanced tab, optionally configure the following
properties:
Advanced Property Description Stored as Avro Uses the Stored As Avro clause in the SQL commands that generate Hive tables when the Drift Synchronization Solution for Hive processes Avro data. When selected, the Avro schema URL is not included in the query. Schema Folder Location Location to store the Avro schemas when Stored as Avro is not selected. Use a leading slash ( / ) to specify a fully-qualified path. Omit the slash to specify a path relative to the table directory.
When not specified, the schemas are stored in the .schema sub-folder of the table directory.
HDFS User Optional HDFS user for the destination to use when generating schemas.
Header Attribute Expressions When generating events, adds the specified header attributes to the event record. Using simple or bulk edit mode, click the Add icon to include custom record header attributes. Then, enter the name and value for the attribute.
You can use expressions to define the name and value to use.Max Cache Size (entries) Maximum number of entries in the cache. When the cache reaches the maximum size, the oldest cached entries are evicted to allow for new data.
Default is -1, an unlimited cache size.