Drift Synchronization Solution for Hive

The Drift Synchronization Solution for Hive detects drift in incoming data and updates corresponding Hive tables.

Previously known as the Hive Drift Solution, the Drift Synchronization Solution for Hive enables creating and updating Hive tables based on record requirements and writing data to HDFS or MapR FS based on record header attributes. You can use the full functionality of the solution or individual pieces, as needed.

The Drift Synchronization Solution for Hive supports processing Avro and Parquet data. When processing Parquet data, the solution generates temporary Avro files and uses the MapReduce executor to convert the Avro files to Parquet.

The solution is compatible with Impala, but requires additional steps to refresh the Impala metadata cache.

Tip: You can also download the sample Drift Synchronization for Hive pipeline from the StreamSets Data Collector pipeline library, import the pipeline into Data Collector, and then follow these instructions for more details on the solution.