Hive Query
The Hive Query executor connects to Hive or Impala and performs one or more user-defined Hive or Impala queries each time it receives an event record. For information about supported versions, see Supported Systems and Versions.
Use the Hive Query executor as part of an event stream to perform event-driven queries in Hive or Impala. You can use the executor in any logical way, such as running Hive or Impala queries after the Hive Metadata destination updates the Hive metastore, or after the Hadoop FS or MapR FS destination closes files.
For example, you can use the Hive Query executor to perform the Invalidate Metadata query for Impala as part of the Drift Synchronization Solution for Hive or to configure table properties for newly-created tables.
When using the Hive Query executor with Impala, you can use the default driver included with Data Collector, or you can install an Impala JDBC driver.
When you configure the Hive Query executor, you configure JDBC connection information to Hive, and optionally add additional HDFS configuration properties to use. You can also use a connection to configure the executor. You specify the queries that you want to run and indicate whether to run the remaining queries after a query fails.
You can also configure the executor to generate events for another event stream. For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.
For a solution that describes how to use the Hive Query executor, see Automating Impala Metadata Updates for Drift Synchronization for Hive.
Related Event Generating Stages
Use the Hive Query executor in an event stream. The Hive Query executor is designed to run a set of Hive or Impala queries after receiving an event record. You can use the Hive Query executor with any event-generating stage where the logic suits your needs.
When implementing the Drift Synchronization Solution for Hive with Impala, use the executor to run the Invalidate Metadata query after the Hive Metastore destination changes table structures and after the Hadoop FS destination writes files to Hive.
Installing the Impala Driver
You can use the Apache Hive JDBC driver included with Data Collector to perform Impala queries. However, some distributions recommend using a native Impala JDBC driver.
To use the included Apache Hive JDBC driver included with Data Collector, you do not need to perform any additional steps.
- Download the native Impala JDBC driver for the Hive distribution that you use.
- Install the driver as an external library for the stage library used by the Hive
Query executor.
For example, say the executor is configured to use the Cloudera CDP 7.1.8 stage library. You install the driver into the Cloudera CDP 7.1.8 stage library,
streamsets-datacollector-cdp_7_1_8-lib
, which includes the executor.
For information about installing additional drivers, see Install External Libraries.
If you have trouble determining the URL format to use when configuring the executor, check out our Ask StreamSets post.
Hive and Impala Queries
You can use the Hive Query executor to execute a set of Hive or Impala queries each time the executor receives an event record.
The Hive Query executor waits for each query to complete before continuing with the next query for the same event record. It also waits for all queries to complete before starting the queries for the next event record. Depending on the speed of the pipeline and the complexity of the queries, the wait for query completion can slow pipeline performance.
When possible, avoid using Hive Query executor to run long-running queries. Also, when running multiple queries for an event record, you can configure the executor to skip the remaining queries if a query fails. By default, the executor continues to run the rest of the queries.
You can use the fields and attributes in an event record in queries. For example, for event records generated when Hive Metastore creates or updates a table, you can use the table name in the event record to perform additional tasks.
For a list of field names and descriptions in an event record, see the "Event Records" documentation for the event generating stage.
Impala Queries for the Drift Synchronization Solution for Hive
The Drift Synchronization Solution for Hive enables a pipeline to automatically create and update Hive tables and to write files to the tables.
When implementing the Drift Synchronization Solution for Hive with Impala, you can use the Hive Query executor to submit an invalidate metadata query each time you need to update the Impala metadata cache. For a detailed example, see Automating Impala Metadata Updates for Drift Synchronization for Hive.
- Processing event records from the Hive Metastore destination
- The Hive Metastore destination generates an event record each time it
changes a table and places the table name in the "table" record header
attribute. Use the following query to update the Impala metadata cache:
invalidate metadata ${record:attribute('/table')}
- Processing event records from the Hadoop FS destination
- The Hadoop FS destination generates an event record each time it closes a
file. It places the file path in a "filepath" field in the event record. If
you use a separate Hive Query executor for each destination, use the
following query to update the Impala
cache:
invalidate metadata `${file:pathElement(record:value('/filepath'), -3)}`. `${file:pathElement(record:value('/filepath'), -2)}`
Event Generation
The Hive Query executor can generate events that you can use in an event stream.
The executor submits queries to Hive each time it receives an event record. When you enable event generation, the executor generates events each time it determines if the submitted query has completed.
- With the Email executor to send a custom email
after receiving an event.
For an example, see Sending Email During Pipeline Processing.
- With a destination to store event information.
For an example, see Preserving an Audit Trail of Events.
For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.
Event Records
Record Header Attribute | Description |
---|---|
sdc.event.type | Event type. Uses one of the following event types:
|
sdc.event.version | Integer that indicates the version of the event record type. |
sdc.event.creation_timestamp | Epoch timestamp when the stage created the event. |
- Successful query
-
The executor generates a successful query event record when Hive successfully completes the query.
A successful query event record has the sdc.event.type record header attribute set to successful-query and includes the following field:Event Field Name Description query The query that Hive ran successfully. - Failed query
-
The executor generates a failed query event record when Hive fails to run the query.
A failed query event record has the sdc.event.type record header attribute set to failed-query and can include the following fields:Event Field Name Description query The query that Hive failed to run. unexecuted-queries Any additional queries that did not execute. The event record includes this field only when you configure the executor to skip running subsequent queries when a query fails.
Configuring a Hive Query Executor
Configure a Hive Query executor to execute a query on Hive or Impala when the executor receives an event record.
-
In the Properties panel, on the General tab, configure the
following properties:
General Property Description Name Stage name. Description Optional description. Stage Library Library version that you want to use. Produce Events Generates event records when events occur. Use for event handling. Required Fields Fields that must include data for the record to be passed into the stage. Tip: You might include fields that the stage uses.Records that do not include all required fields are processed based on the error handling configured for the pipeline.
Preconditions Conditions that must evaluate to TRUE to allow a record to enter the stage for processing. Click Add to create additional preconditions. Records that do not meet all preconditions are processed based on the error handling configured for the stage.
On Record Error Error record handling for the stage: - Discard - Discards the record.
- Send to Error - Sends the record to the pipeline for error handling.
- Stop Pipeline - Stops the pipeline.
-
On the Hive tab, configure the following
properties:
Hive Property Description Connection Connection that defines the information required to connect to an external system. To connect to an external system, you can select a connection that contains the details, or you can directly enter the details in the pipeline. When you select a connection, Control Hub hides other properties so that you cannot directly enter connection details in the pipeline.
To create a new connection, click the Add New Connection icon: . To view and edit the details of the selected connection, click the Edit Connection icon: .
JDBC URL JDBC URL for Hive. For details about specifying the URL, see our StreamSets Community post.
You can optionally include the user name and password in the JDBC URL. If you include a password with special characters, you must URL-encode (also called percent-encoding) the special characters. Otherwise errors will occur when validating or running your pipeline. For example, if your JDBC URL looks like this:jdbc:hive2://sunnyvale:12345/default;user=admin;password=a#b!c$e
URL-encode your password so that your JDBC URL looks like this:jdbc:hive2://sunnyvale:12345/default;user=admin;password=a%23b%21c%24e
JDBC Driver Name The fully-qualified JDBC driver name. Before using an Impala JDBC driver, install the driver as an external library for the stage library used by the Hive Query executor. For more information, see Installing the Impala Driver.
Use Credentials Enables entering credentials in properties. Use when you do not include credentials in the JDBC URL. Note: To impersonate the current user in connections to Hive, you can edit the Data Collector configuration properties to configure Data Collector to automatically impersonate the user without specifying credentials in the pipeline. See Configuring Data Collector in the Data Collector documentation.Username User name for the JDBC connection. The user account must have the correct permissions or privileges in the database.
Password Password for the JDBC user name. Tip: To secure sensitive information such as user names and passwords, you can use runtime resources or credential stores.Hadoop Configuration Directory Absolute path to the directory containing the following Hive and Hadoop configuration files:
- core-site.xml
- hdfs-site.xml
- hive-site.xml
Note: Properties in the configuration files are overridden by individual properties defined in the Additional Hadoop Configuration property.Additional Hadoop Configuration
Additional properties to use.
Using simple or bulk edit mode, click Add to add additional properties and define the property name and value. Use the property names and values as expected by Hive and HDFS or MapR FS.
Additional JDBC Configuration Properties Additional JDBC configuration properties to pass to the JDBC driver. Using simple or bulk edit mode, click Add to add additional properties and define the property name and value. Use the property names and values as expected by the JDBC driver.
-
On the Query tab, configure the following
property:
Query Property Description SQL Queries One or more SQL queries to execute on Hive or Impala upon receiving an event record. Using simple or bulk edit mode, click the Add icon to add additional queries.
The executor processes multiple queries in order, and waits for each query to complete before continuing to the next query.
To process events from the Hive Metastore or Hadoop FS destinations when using the Drift Synchronization Solution for Hive with Impala, you might use the following query:invalidate metadata ${record:attribute('/table')}
For more information, see Impala Queries for the Drift Synchronization Solution for Hive.
Stop on Query Failure Skips the remaining queries for an event record when a query fails. By default, upon a failed query, the executor continues to run all configured queries for the event record.
When selected, the executor skips the remaining queries and starts running queries for the next event record.