Whole File Transformer

Supported pipeline types:
  • Data Collector

The Whole File Transformer processor transforms fully written Avro files to highly efficient, columnar Parquet files. Use the Whole File Transformer in a pipeline that reads Avro files as whole files and writes the transformed Parquet files as whole files.

Origins and destinations that support whole files include cloud storage stages such as Amazon S3, Azure Data Lake Storage, and Google Cloud Storage, and as well as local and remote file systems such as Local FS, SFTP/FTP/FTPS Client, and Hadoop FS. For a full list of whole file origins and destinations or more information about the whole file data format, see Whole File Data Format.

You can use the Whole File Transformer to convert Avro files to Parquet within a pipeline. If a Hadoop cluster is available, you can use the MapReduce executor to convert Avro files to Parquet instead of the Whole File Transformer. The MapReduce executor delegates the conversion task to the Hadoop cluster. For a case study on capturing data drift and producing Parquet files using the MapReduce executor, see Parquet Case Study.

When performing the conversion of Avro files to Parquet, the Whole File Transformer performs the conversion in memory, then writes a temporary Parquet file in a local directory on the Data Collector machine. Ensure that Data Collector has the necessary memory and storage to perform this processing.

When you configure the Whole File Transformer, you specify the local directory to use. You can configure a prefix and suffix for the resulting Parquet files and the buffer size and rate at which to process the Avro files. You can also configure standard Parquet properties, such as a compression codec, row group size, and page size.

Typically, when using the Whole File Transformer processor, you will use a separate pipeline for processing data and generating the Avro files to be converted. But if you have Avro files generated by a third party, then you can simply create the pipeline to convert the files to Parquet.