Parquet Data Format

Data Collector can read and write Parquet data.

Reading Parquet Data

When reading Parquet data, origins generate records for every Parquet record in the file. The file must contain the Parquet schema. The origin uses the Parquet schema to generate records.

Generated records include the Parquet schema in a parquetSchema record header attribute.

When Skip Union Indexes is not enabled, the origin generates an avro.union.typeIndex./id record header attribute identifying the index number of the element in the union that the data is read from. If a schema contains many unions and the pipeline does not depend on index information, you can enable Skip Union Indexes to avoid long processing times associated with storing a large number of indexes.

For a list of origins that read Parquet data, see Data Formats by Stage.

Writing Parquet Data

When writing Parquet data, destinations write an object or file for each partition and include the Parquet schema in every object or file.

Optimally, a Parquet file contains all batches in a single file. Parquet files generated by Data Collector have the following performance limitations:

  • The Local FS destination generates a single file with all batches, but with a small group size.
  • All other destinations that generate Parquet files generate one file per batch.

For a list of stages that write Parquet data, see Data Formats by Stage.