Parquet Data Format
Data Collector can read and write Parquet data.
Reading Parquet Data
When reading Parquet data, origins generate records for every Parquet record in the file. The file must contain the Parquet schema. The origin uses the Parquet schema to generate records.
Generated records include the Parquet schema in a
                                                  parquetSchema
                                                record header attribute.
        
When Skip Union Indexes
                                          is not enabled, the origin generates an
                                                avro.union.typeIndex./id
                                          record header attribute identifying the
                                          index number of the element in the union that the data is
                                          read from. If a schema contains many unions and the
                                          pipeline does not depend on index information, you can
                                          enable Skip Union Indexes to avoid long processing times
                                          associated with storing a large number of
                                    indexes.
For a list of origins that read Parquet data, see Data Formats by Stage.
Writing Parquet Data
When writing Parquet data, destinations write an object or file for each partition and include the Parquet schema in every object or file.
Optimally, a Parquet file contains all batches in a single file. Parquet files generated by Data Collector have the following performance limitations:
- The Local FS destination generates a single file with all batches, but with a small group size.
 - All other destinations that generate Parquet files generate one file per batch.
 
For a list of stages that write Parquet data, see Data Formats by Stage.