Hive

The Hive origin reads data from a Hive table. Hive is a transactional storage layer that works on top of Hadoop Distributed File System (HDFS) and Apache Spark. Hive stores files in tables on HDFS.

By default, the origin reads from Hive using connection information stored in Hive configuration files on the Transformer machine. Alternatively, the origin can use connection information stored in an external Hive Metastore that you specify.

When you configure the Hive origin, you indicate if the origin should run in incremental mode or full query mode. You define the query to use, the offset column, and optionally, an initial offset to use. When needed, you can specify URIs for an external Hive Metastore where configuration information is stored.

You can configure the origin to load data only once and cache the data for reuse throughout the pipeline run. Or, you can configure the origin to cache each batch of data so the data can be passed to multiple downstream batches efficiently. You can also configure the origin to skip tracking offsets.