Hive
The Hive origin reads data from a Hive table. Hive is a transactional storage layer that works on top of Hadoop Distributed File System (HDFS). Hive stores files in tables on HDFS.
By default, the origin reads from Hive using connection information stored in Hive configuration files on the Transformer machine. Alternatively, the origin can use connection information stored in an external Hive Metastore that you specify.
When you configure the origin, you indicate if the origin should run in incremental mode or full query mode. You define the query to use, the offset column, and optionally, an initial offset to use. When needed, you can specify URIs for an external Hive Metastore where configuration information is stored.
You can configure the origin to load data only once and cache the data for reuse throughout the pipeline run. Or, you can configure the origin to cache each batch of data so the data can be passed to multiple downstream batches efficiently. You can also configure the origin to skip tracking offsets.
Partitioning
Spark runs a Transformer pipeline just as it runs any other application, splitting the data into partitions and performing operations on the partitions in parallel. When the pipeline starts processing a new batch, Spark determines how to split pipeline data into initial partitions based on the origins in the pipeline.
With a Hive origin, Spark determines partitioning based on the partitioning configured within the Hive source table. If no partitioning is configured, the origin reads all available data within a single partition.
Spark uses partitions created by the origin throughout the pipeline unless a processor causes Spark to shuffle the data. To change the partitioning in the pipeline, use the Repartition processor.
Reading a Delta Lake Managed Table
You can use the Hive origin to read a Delta Lake managed table in streaming execution mode, or in batch execution mode with offset tracking. For all other cases, such as reading an unmanaged table, or reading in batch execution mode without offset tracking, use the Delta Lake origin.
Incremental and Full Query Mode
The Hive origin can run in full query mode or in incremental mode. By default, the origin runs in full query mode.
When the origin runs in full query mode with no initial offset defined, the origin processes all available data each time the pipeline runs. If you configure an initial offset, the origin starts the read with the initial offset each time the pipeline runs.
When the origin runs in incremental mode, the first pipeline run is the same as full query mode: the origin starts the read with the initial offset, if defined. Otherwise, it reads all available data. When the pipeline stops, the origin stores the offset where it stopped processing. For subsequent pipeline runs, the origin starts the read with the last-saved offset.
SQL Query
The SQL query defines the data returned from Hive. You can use any valid SQL query but query guidelines depend on whether you configure the origin to run in incremental mode or full query mode.
Incremental Mode Guidelines
When you define the SQL query for incremental mode, the Hive origin requires a WHERE clause and an ORDER BY clause in the query.
- WHERE clause
- In the WHERE clause, include the offset column and the
offset value. Use the
$offset
variable to represent the offset value. - ORDER BY clause
- In the ORDER BY clause, include the offset column as the
first column to avoid returning duplicate data.Note: Using a column that is not a primary key or indexed column in the ORDER BY clause can slow performance.
Full Mode Guidelines
SELECT * FROM invoice
When you define the SQL query for full mode, you can optionally include the WHERE and ORDER BY clauses using the same guidelines as for incremental mode. However, using these clauses to read from large tables can cause performance issues.
Additional Hive Configuration Properties
When needed, you can pass additional Hive configuration properties to Hive as additional Spark configuration properties. Simply define Extra Spark Configuration properties on the Cluster tab of the pipeline properties panel.
Hive configuration properties defined in the pipeline override those defined in Hive configuration files.
Configuring a Hive Origin
Configure a Hive origin to read from a Hive table.
-
On the Properties panel, on the General tab, configure the
following properties:
General Property Description Name Stage name. Description Optional description. Load Data Only Once Reads data while processing the first batch of a pipeline run and caches the results for reuse throughout the pipeline run. Select this property for lookup origins. When configuring lookup origins, do not limit the batch size. All lookup data should be read in a single batch.
Cache Data Caches data processed for a batch so the data can be reused for multiple downstream stages. Use to improve performance when the stage passes data to multiple stages. Caching can limit pushdown optimization when the pipeline runs in ludicrous mode.
Available when Load Data Only Once is not enabled. When the origin loads data once, the origin caches data for the entire pipeline run.
Skip Offset Tracking Skips tracking offsets. The origin reads all available data for each batch that the pipeline processes, limited by any batch-size configuration for the origin.
-
On the Hive tab, configure the following properties:
Hive Property Description Incremental Mode Enables the origin to run in incremental mode. When not enabled, the origin runs in full query mode.
Hive Metastore URIs Optional comma-separated list of URIs to an external Hive metastore that contains the Hive connection information to use. Use the following URI format:thrift://<metastore-host>:<metastore:port>
When not defined, the origin uses connection information defined in the
hive-site.xml
andcore-site.xml
Hive configuration files on the Transformer machine.SQL Query SQL query to use. The query requirements differ depending on whether the origin runs in incremental mode or full query mode. Initial Offset Optional initial offset value. When you use the $offset
parameter in the query, this value is substituted for the parameter when you start the pipeline.Required in incremental mode.
Offset Column Column to track the progress of the read. As a best practice, an offset column should be an incremental and unique column that does not contain null values. Having an index on this column is strongly encouraged since the underlying query uses an ORDER BY clause and inequality operators on this column.
Required in incremental mode.