Parquet Data Format
Data Collector can read and write Parquet data.
Reading Parquet Data
When reading Parquet data, origins generate records for every Parquet record in the file. The file must contain the Parquet schema. The origin uses the Parquet schema to generate records.
Generated records include the Parquet schema in a
parquetSchema
record header attribute.
When Skip Union Indexes
is not enabled, the origin generates an
avro.union.typeIndex./id
record header attribute identifying the
index number of the element in the union that the data is
read from. If a schema contains many unions and the
pipeline does not depend on index information, you can
enable Skip Union Indexes to avoid long processing times
associated with storing a large number of
indexes.
For a list of origins that read Parquet data, see Data Formats by Stage.
Writing Parquet Data
When writing Parquet data, destinations write an object or file for each partition and include the Parquet schema in every object or file.
Optimally, a Parquet file contains all batches in a single file. Parquet files generated by Data Collector have the following performance limitations:
- The Local FS destination generates a single file with all batches, but with a small group size.
- All other destinations that generate Parquet files generate one file per batch.
For a list of stages that write Parquet data, see Data Formats by Stage.