Hive Query

The Hive Query executor connects to Hive or Impala and performs one or more user-defined Hive or Impala queries each time it receives an event record. For information about supported versions, see Supported Systems and Versions in the Data Collector documentation.

Use the Hive Query executor as part of an event stream to perform event-driven queries in Hive or Impala. You can use the executor in any logical way, such as running Hive or Impala queries after the Hive Metadata destination updates the Hive metastore, or after the Hadoop FS or MapR FS destination closes files.

For example, you can use the Hive Query executor to perform the Invalidate Metadata query for Impala as part of the Drift Synchronization Solution for Hive or to configure table properties for newly-created tables.

When using the Hive Query executor with Impala, you can use the default driver included with Data Collector, or you can install an Impala JDBC driver.

Note: The Hive Query executor waits for each query to complete before continuing with the next query for the same event record. It also waits for all queries to complete before starting the queries for the next event record. Depending on the speed of the pipeline and the complexity of the queries, the wait for query completion can slow pipeline performance.

When you configure the Hive Query executor, you configure JDBC connection information to Hive, and optionally add additional HDFS configuration properties to use. You can also use a connection to configure the executor. You specify the queries that you want to run and indicate whether to run the remaining queries after a query fails.

You can also configure the executor to generate events for another event stream. For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.

For a solution that describes how to use the Hive Query executor, see Automating Impala Metadata Updates for Drift Synchronization for Hive.

Related Event Generating Stages

Use the Hive Query executor in an event stream. The Hive Query executor is designed to run a set of Hive or Impala queries after receiving an event record. You can use the Hive Query executor with any event-generating stage where the logic suits your needs.

When implementing the Drift Synchronization Solution for Hive with Impala, use the executor to run the Invalidate Metadata query after the Hive Metastore destination changes table structures and after the Hadoop FS destination writes files to Hive.

Installing the Impala Driver

You can use the Apache Hive JDBC driver included with Data Collector to perform Impala queries. However, some distributions recommend using a native Impala JDBC driver.

To use the included Apache Hive JDBC driver included with Data Collector, you do not need to perform any additional steps.

To use an Impala JDBC driver, perform the following steps:
  1. Download the native Impala JDBC driver for the Hive distribution that you use.
  2. Install the driver as an external library for the stage library used by the Hive Query executor.

    For example, say the executor is configured to use the Cloudera CDP 7.1.8 stage library. You install the driver into the Cloudera CDP 7.1.8 stage library, streamsets-datacollector-cdp_7_1_8-lib, which includes the executor.

For information about installing additional drivers, see Install External Libraries in the Data Collector documentation.

If you have trouble determining the URL format to use when configuring the executor, check out this informative blog post.

Hive and Impala Queries

You can use the Hive Query executor to execute a set of Hive or Impala queries each time the executor receives an event record.

The Hive Query executor waits for each query to complete before continuing with the next query for the same event record. It also waits for all queries to complete before starting the queries for the next event record. Depending on the speed of the pipeline and the complexity of the queries, the wait for query completion can slow pipeline performance.

When possible, avoid using Hive Query executor to run long-running queries. Also, when running multiple queries for an event record, you can configure the executor to skip the remaining queries if a query fails. By default, the executor continues to run the rest of the queries.

You can use the fields and attributes in an event record in queries. For example, for event records generated when Hive Metastore creates or updates a table, you can use the table name in the event record to perform additional tasks.

For a list of field names and descriptions in an event record, see the "Event Records" documentation for the event generating stage.

Impala Queries for the Drift Synchronization Solution for Hive

The Drift Synchronization Solution for Hive enables a pipeline to automatically create and update Hive tables and to write files to the tables.

When implementing the Drift Synchronization Solution for Hive with Impala, you can use the Hive Query executor to submit an invalidate metadata query each time you need to update the Impala metadata cache. For a detailed example, see Automating Impala Metadata Updates for Drift Synchronization for Hive.

Connect a Hive Query executor to the event stream from the Hive Metastore destination and the Hadoop FS destination. You can use the same executor for both, or use a separate executor for each.
Processing event records from the Hive Metastore destination
The Hive Metastore destination generates an event record each time it changes a table and places the table name in the "table" record header attribute. Use the following query to update the Impala metadata cache:
invalidate metadata ${record:attribute('/table')}
When the Hive Query executor receives the event record, it runs the Invalidate Metadata query on the table specified in the event record.
Processing event records from the Hadoop FS destination
The Hadoop FS destination generates an event record each time it closes a file. It places the file path in a "filepath" field in the event record. If you use a separate Hive Query executor for each destination, use the following query to update the Impala cache:
invalidate metadata `${file:pathElement(record:value('/filepath'), -3)}`.
`${file:pathElement(record:value('/filepath'), -2)}`      
This expression uses the third-to-last section of the path as the database name, and the second-to-last section of the path as the table name.
If you want to use the same Hive Query executor to process records from Hive Query and Hadoop FS, add an Expression Evaluator in the Hadoop FS event stream to perform this processing. For an example, see Automating Impala Metadata Updates for Drift Synchronization for Hive.

Event Generation

The Hive Query executor can generate events that you can use in an event stream.

The executor submits queries to Hive each time it receives an event record. When you enable event generation, the executor generates events each time it determines if the submitted query has completed.

Hive Query events can be used in any logical way. For example:

For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.

Event Records

Event records generated by the Hive Query executor have the following event-related record header attributes. Record header attributes are stored as String values:
Record Header Attribute Description
sdc.event.type Event type. Uses one of the following event types:
  • successful-query - Generated when the executor determines that Hive successfully ran the submitted query.
  • failed-query - Generated when the executor determines that Hive failed to run the query.
sdc.event.version Integer that indicates the version of the event record type.
sdc.event.creation_timestamp Epoch timestamp when the stage created the event.
The Hive Query executor can generate the following types of event records:
Successful query

The executor generates a successful query event record when Hive successfully completes the query.

A successful query event record has the sdc.event.type record header attribute set to successful-query and includes the following field:
Event Field Name Description
query The query that Hive ran successfully.
Failed query

The executor generates a failed query event record when Hive fails to run the query.

A failed query event record has the sdc.event.type record header attribute set to failed-query and can include the following fields:
Event Field Name Description
query The query that Hive failed to run.
unexecuted-queries Any additional queries that did not execute.

The event record includes this field only when you configure the executor to skip running subsequent queries when a query fails.

Configuring a Hive Query Executor

Configure a Hive Query executor to execute a query on Hive or Impala when the executor receives an event record.

  1. In the Properties panel, on the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
    Stage Library Library version that you want to use.
    Produce Events Generates event records when events occur. Use for event handling.
    Required Fields Fields that must include data for the record to be passed into the stage.
    Tip: You might include fields that the stage uses.

    Records that do not include all required fields are processed based on the error handling configured for the pipeline.

    Preconditions Conditions that must evaluate to TRUE to allow a record to enter the stage for processing. Click Add to create additional preconditions.

    Records that do not meet all preconditions are processed based on the error handling configured for the stage.

    On Record Error Error record handling for the stage:
    • Discard - Discards the record.
    • Send to Error - Sends the record to the pipeline for error handling.
    • Stop Pipeline - Stops the pipeline.
  2. On the Hive tab, configure the following properties:
    Hive Property Description
    Connection Connection that defines the information required to connect to an external system.

    To connect to an external system, you can select a connection that contains the details, or you can directly enter the details in the pipeline. When you select a connection, Control Hub hides other properties so that you cannot directly enter connection details in the pipeline.

    JDBC URL

    JDBC URL for Hive. For details about specifying the URL, see this informative community post.

    You can optionally include the user name and password in the JDBC URL. If you include a password with special characters, you must URL-encode (also called percent-encoding) the special characters. Otherwise errors will occur when validating or running your pipeline. For example, if your JDBC URL looks like this:
    jdbc:hive2://sunnyvale:12345/default;user=admin;password=a#b!c$e
    URL-encode your password so that your JDBC URL looks like this:
    jdbc:hive2://sunnyvale:12345/default;user=admin;password=a%23b%21c%24e
    Tip: To secure sensitive information, you can use runtime resources or credential stores. For more information about credential stores, see Credential Stores in the Data Collector documentation.
    JDBC Driver Name The fully-qualified JDBC driver name.

    Before using an Impala JDBC driver, install the driver as an external library for the stage library used by the Hive Query executor. For more information, see Installing the Impala Driver.

    Use Credentials Enables entering credentials in properties. Use when you do not include credentials in the JDBC URL.
    Note: To impersonate the current user in connections to Hive, you can edit the Data Collector configuration file, $SDC_CONF/sdc.properties to configure Data Collector to automatically impersonate the user without specifying credentials in the pipeline. See Configuring Data Collector in the Data Collector documentation.
    Username User name for the JDBC connection.

    The user account must have the correct permissions or privileges in the database.

    Password Password for the JDBC user name.
    Tip: To secure sensitive information such as user names and passwords, you can use runtime resources or credential stores. For more information about credential stores, see Credential Stores in the Data Collector documentation.
    Hadoop Configuration Directory

    Absolute path to the directory containing the following Hive and Hadoop configuration files:

    • core-site.xml
    • hdfs-site.xml
    • hive-site.xml

    For a Cloudera Manager installation, enter hive-conf.

    Note: Properties in the configuration files are overridden by individual properties defined in the Additional Hadoop Configuration property.

    Additional Hadoop Configuration

    Additional properties to use.

    Using simple or bulk edit mode, click Add to add additional properties and define the property name and value. Use the property names and values as expected by Hive and HDFS or MapR FS.

    Additional JDBC Configuration Properties Additional JDBC configuration properties to pass to the JDBC driver.

    Using simple or bulk edit mode, click Add to add additional properties and define the property name and value. Use the property names and values as expected by the JDBC driver.

  3. On the Query tab, configure the following property:
    Query Property Description
    SQL Queries One or more SQL queries to execute on Hive or Impala upon receiving an event record.

    Using simple or bulk edit mode, click the Add icon to add additional queries.

    The executor processes multiple queries in order, and waits for each query to complete before continuing to the next query.

    To process events from the Hive Metastore or Hadoop FS destinations when using the Drift Synchronization Solution for Hive with Impala, you might use the following query:
    invalidate metadata ${record:attribute('/table')}

    For more information, see Impala Queries for the Drift Synchronization Solution for Hive.

    Stop on Query Failure Skips the remaining queries for an event record when a query fails.

    By default, upon a failed query, the executor continues to run all configured queries for the event record.

    When selected, the executor skips the remaining queries and starts running queries for the next event record.