Hive Streaming (deprecated)

Data Collector

The Hive Streaming destination writes data to Hive tables stored in the ORC (Optimized Row Columnar) file format. For information about supported versions, see Supported Systems and Versions.

Important: This stage is deprecated and may be removed in a future release.

Before you use the destination, verify that your Hadoop implementation supports Hive Streaming.

When configuring Hive Streaming, you specify the Hive metastore and a bucketed table stored in the ORC file format. You define the location of the Hive and Hadoop configuration files and optionally specify additional required properties. By default, the destination creates new partitions as needed.

Hive Streaming writes data to the table based on the matching field names. You can defining custom field mappings that override the default field mappings.

Before you use the Hive Streaming destination with the MapR library in a pipeline, you must perform additional steps to enable Data Collector to process MapR data. For more information, see MapR Prerequisites.

Hive Properties and Configuration Files

You can configure Hive Streaming to use Hive and Hadoop configuration files and additional properties:

Configuration files

The following configuration files are required for the Hive Streaming destination:

core-site.xml
hdfs-site.xml
hive-site.xml

To use the configuration files:

Store the files or a symlink to the files in the Data Collector resources directory or elsewhere in a path local to the Data Collector.
If the files are stored in the resources directory, specify a relative path to the files in the stage. If the files are stored outside of the resources directory, specify an absolute path to the files.
Note: For a Cloudera Manager installation, Data Collector automatically creates a symlink to the files named hive-conf. Enter hive-conf for the location of the files in the stage.

Individual properties

You can configure individual Hive properties in the destination. To add a Hive property, specify the exact property name and the value. The destination does not validate the property names or values.

Note: Individual properties override properties defined in the configuration files.

Configuring a Hive Streaming Destination

Use the Hive Streaming destination to write data to Hive.

Important: This stage is deprecated and may be removed in a future release.

In the Properties panel, on the General tab, configure the following properties:


General Property	Description
Name	Stage name.
Description	Optional description.
Stage Library	Library version that you want to use.
Required Fields	Fields that must include data for the record to be passed into the stage. Tip: You might include fields that the stage uses. Records that do not include all required fields are processed based on the error handling configured for the pipeline.
Preconditions	Conditions that must evaluate to TRUE to allow a record to enter the stage for processing. Click Add to create additional preconditions. Records that do not meet all preconditions are processed based on the error handling configured for the stage.
On Record Error	Error record handling for the stage: Discard - Discards the record. Send to Error - Sends the record to the pipeline for error handling. Stop Pipeline - Stops the pipeline.

On the Hive tab, configure the following properties:


Hive Property	Description
Hive Metastore Thrift URL	Thrift URI for the Hive metastore. Use the following format: `thrift://<host>:<port>` The port number is typically 9083.
Schema	Hive schema.
Table	Bucketed Hive table stored in as an ORC file.
Hive Configuration Directory	Absolute path to the directory containing the Hive and Hadoop configuration files. For a Cloudera Manager installation, enter `hive-conf`. The destination uses the following configuration files: core-site.xml hdfs-site.xml hive-site.xml Note: Properties in the configuration files are overridden by individual properties defined in this destination.
Field to Column Mapping	Use to override the default field to column mappings. By default, fields are written to columns of the same name.
Create Partitions	Automatically creates partitions when needed. Used for partitioned tables only.

On the Advanced tab, optionally configure the following properties:


Advanced Property	Description
Transaction Batch Size	The number of transactions to request in a batch for each partition in the table. For more information, see the Hive documentation. Default is 1000 transactions.
Buffer Limit (KB)	Maximum size of the record to be written to the destination. Increase the size to accommodate larger records. Records that exceed the limit are handled based on the error handling configured for the stage.
Hive Configuration	Additional Hive properties to use. Using simple or bulk edit mode, click the Add icon and define the property name and value. Use the property names and values as expected by Hive.