Whole File Transformer

The Whole File Transformer processor transforms fully written Avro files to highly efficient, columnar Parquet files. Use the Whole File Transformer in a pipeline that reads Avro files as whole files and writes the transformed Parquet files as whole files.

Origins and destinations that support whole files include cloud storage stages such as Amazon S3, Azure Data Lake Storage, and Google Cloud Storage, and as well as local and remote file systems such as Local FS, SFTP/FTP/FTPS Client, and Hadoop FS. For a full list of whole file origins and destinations or more information about the whole file data format, see Whole File Data Format.

You can use the Whole File Transformer to convert Avro files to Parquet within a pipeline. If a Hadoop cluster is available, you can use the MapReduce executor to convert Avro files to Parquet instead of the Whole File Transformer. The MapReduce executor delegates the conversion task to the Hadoop cluster. For a case study on capturing data drift and producing Parquet files using the MapReduce executor, see Parquet Case Study.

When performing the conversion of Avro files to Parquet, the Whole File Transformer performs the conversion in memory, then writes a temporary Parquet file in a local directory on the Data Collector machine. Ensure that Data Collector has the necessary memory and storage to perform this processing.

When you configure the Whole File Transformer, you specify the local directory to use. You can configure a prefix and suffix for the resulting Parquet files and the buffer size and rate at which to process the Avro files. You can also configure standard Parquet properties, such as a compression codec, row group size, and page size.

Typically, when using the Whole File Transformer processor, you will use a separate pipeline for processing data and generating the Avro files to be converted. But if you have Avro files generated by a third party, then you can simply create the pipeline to convert the files to Parquet.

Implementation Overview

When using the Whole File Transformer processor, you will typically configure two pipelines. The pipeline that includes the Whole File Transformer must be a whole file pipeline, which does not permit additional processing of file data. As a result, any processing that you require must take place in a separate pipeline.

This separate pipeline includes an origin to read data of any data format, the processors that you need, and a file-based destination to write Avro files. Let's call this pipeline the Write Avro pipeline. If you already have Avro files ready, you can skip this pipeline.

Then, a second pipeline reads the Avro files using the whole file data format, transforms the files to Parquet using the Whole File Transformer, then writes the Parquet files as whole files. Let's call this the Parquet Conversion pipeline.

Together, the implementation looks like this:

Write Avro Pipeline

The Write Avro pipeline reads data from the origin, performs any required processing, and uses a file-based Avro destination such as Local FS to write Avro files to a file system.

The destination file system acts as a staging area. As the Write Avro pipeline closes each Avro file, the Parquet Conversion pipeline can begin processing the file and convert it to Parquet.

Since each Avro file is ultimately transformed to a corresponding Parquet file, the Avro files should be of substantial size to take advantage of Parquet capabilities. 2 GB is recommended.

The Write Avro pipeline includes any origin, any number of processors, and an Avro-enabled file-based destination, as follows:

For a list of destinations that support writing Avro, see Data Format Support.

Parquet Conversion Pipeline

The Parquet Conversion pipeline processes the closed Avro files as whole files, leveraging the efficiency of the whole file data format. This pipeline converts the Avro files to Parquet using the Whole File Transformer processor.

The Parquet Conversion pipeline must use a file-based origin enabled for whole files, such as the Directory origin, to process the large Avro files.

The Whole File Transformer processor converts each Avro file to Parquet, writing each Parquet file temporarily to a user-defined Data Collector directory. Then a whole file destination, such as Google Cloud Storage, moves the Parquet file to the destination system.

Whole files can only be processed by the Whole File Transformer, so this pipeline performs no additional processing.

As a result, the Parquet Conversion pipeline looks like this:

For a step-by-step review of creating these pipelines, see the implementation example.

For a full list of whole file origins and destinations, see Data Format Support.

Memory and Storage Requirements

The available memory is determined by the Data Collector Java heap size. The available storage is the available disk space on the Data Collector machine. Pipelines using the Whole File Transformer processor require available memory and storage based on the size of the largest Avro file to be converted. For more information, see Java Heap Size in the Data Collector documentation.

For example, when converting files that are up to 2 GB in size, ensure that Data Collector has at least 2 GB of available memory and storage when the pipeline runs.

Generated Record

When processing whole files, the Whole File Transformer processor updates the file information in the whole file record and adds a record header attribute.

Whole file records include fields that contain file reference information. The record does not include any actual data from the file to be streamed.

The Whole File Transformer processor moves the existing information in the fileInfo map field, generated by the pipeline origin, to a new record header attribute named sourceFileInfo. It then updates the fileInfo field with information for the temporary Parquet file, such as the updated path and file location.

Tip: You can use data preview to determine the information and field names that are included in the fileInfo field. The information and field names can differ based on the origin system.

Implementation Example: Amazon S3 Parquet Conversion

Let's say we have several thousand small objects in Amazon S3. We want to convert the numerous small objects to large Parquet objects so we can more quickly analyze the data.

To do this, we set up two pipelines. The first pipeline reads the Amazon S3 objects and writes large Avro files to a local file system. This is the Write Avro pipeline. The second Parquet Conversion pipeline reads the Avro files as whole files and transforms each file to a corresponding Parquet file, which is written back to S3.

The resulting workflow looks like this:

Write Avro Pipeline

The Write Avro pipeline can be pretty simple - it just needs to write large Avro files. You can make it complicated by performing additional processing, but at its most basic, all you need is an origin and an Avro-enabled file-based destination to write the Avro files.

The Write Avro pipeline uses the Amazon S3 origin to read the multitude of small S3 objects. Then, it writes Avro files to a local file system using the Local FS destination, as follows:

Note the following configuration details:

Amazon S3 origin

The Amazon S3 origin requires no special configuration. We simply configure it to access the objects to be processed and specify the data format of the data. Configure error handling, post-processing, and other properties as needed.

Local FS destination

In the Local FS destination, we carefully configure the following properties:

Directory Template - This property determines where the Avro files are written. The directory template that we use determines where the origin of the second pipeline picks up the Avro files.
We'll use /avro2parquet/.
File Prefix - This is an optional prefix for the output file name.
When writing to a file, the Local FS destination creates a temporary file that uses tmp_ as a file name prefix. To ensure that the second pipeline picks up only fully-written output files, we define a file prefix for the output files. It can be something simple, like avro_.
Read Order - To read the files in the order they are written, we use Last Modified Timestamp.
Max File Size - The Avro files are converted to Parquet files in the second pipeline, so we want to ensure that these output files are fairly large.
Let's say we want the Parquet files to be roughly 3 GB in size, so we set Max File Size to 3 GB.
Important: To transform the Avro files, the Whole File Transformer requires Data Collector memory and storage equivalent to the maximum file size. Consider this requirement when setting the maximum file size.
Data Format - We select Avro as the data format, and specify the location of the Avro schema to be used.

Parquet Conversion Pipeline

The Parquet Conversion pipeline is simple as well. It uses the whole file data format, so no processing of file data can occur except the conversion of the Avro files to Parquet.

We use the Directory origin and whole file data format to read the fully written Avro files. Then comes the Whole File Transformer processor for the Parquet conversion, and the Amazon S3 destination to write the Parquet files back to S3 as whole files.

The Parquet Conversion pipeline looks like this:

Note the following configuration details:

Directory origin

In the Directory origin, we carefully configure the following properties:

Files Directory - To read the files written by the first pipeline, point this to the directory used in the Local FS Directory Template property: /avro2parquet/.
File Name Pattern - To pick up all output files, use a glob for file name pattern: avro_*.
We use the avro_ prefix in the pattern to avoid reading the active temporary files generated by the Local FS destination.
Data Format - We use Whole File to stream the data to the Whole File Transformer.

We configure the other properties as needed.

Whole File Transformer processor

The Whole File Transformer processor converts Avro files to Parquet in memory, then writes temporary Parquet files to a local directory. These whole files are then streamed from the local directory to Amazon S3.

When using the Whole File Transformer, we must ensure that the Data Collector machine has the required memory and storage available. Since we set the maximum file size in the Write Avro pipeline to 3 GB, we must ensure that Data Collector has 3 GB of available memory and storage before running this pipeline.

In the Whole File Transformer, we also have the following properties to configure:

Job Type - We're using the Convert Avro to Parquet job.
Temporary File Directory - This is a directory local to Data Collector for the temporary Parquet files. We can use the default, /tmp/out/.parquet.

We can configure other temporary Parquet file properties and Parquet conversion properties as well, but the defaults are fine in this case.

Amazon S3 destination

The Amazon S3 destination streams the temporary Parquet files from the Whole File Transformer temporary file directory to Amazon S3.

We configure this stage to write to Amazon S3, and select the

Whole
                            File

data format.

Runtime

We start both pipelines at the same time.

The Write Avro pipeline processes Amazon S3 objects and writes 3 GB Avro files to the /avro2parquet/ directory. The output files are named avro_<file name>.

After the first file is written, the Parquet Conversion pipeline picks up the file. It streams the file from the Directory origin to the Whole File Transformer processor. The Whole File Transformer converts the Avro file to Parquet in memory, and writes it to the specified temporary directory, /tmp/out/.parquet. The temporary file then streams from the Whole File Transformer to the Amazon S3 location defined in the destination.

In the end, the multitude of Avro files are converted to large columnar Parquet files and streamed back to Amazon S3 for easy analysis. Two simple pipelines, one Whole File Transformer, and you're done!

Configuring a Whole File Transformer Processor

Configure a Whole File Transformer processor to convert Avro whole files to Parquet.

In the Properties panel, on the General tab, configure the following properties:


General Property	Description
Name	Stage name.
Description	Optional description.
Required Fields	Fields that must include data for the record to be passed into the stage. Tip: You might include fields that the stage uses. Records that do not include all required fields are processed based on the error handling configured for the pipeline.
Preconditions	Conditions that must evaluate to TRUE to allow a record to enter the stage for processing. Click Add to create additional preconditions. Records that do not meet all preconditions are processed based on the error handling configured for the stage.
On Record Error	Error record handling for the stage: Discard - Discards the record. Send to Error - Sends the record to the pipeline for error handling. Stop Pipeline - Stops the pipeline.

On the Job tab, configure the following properties:


Job Property	Description
Job Type	Job type. Use the Avro to Parquet job.
Temporary File Directory	A directory local to Data Collector where each temporary Parquet file resides until it is streamed to the destination. You can use constants and record functions when defining the directory. Ensure that the Data Collector machine has the storage required for processing.
Files Prefix	File name prefix for the temporary Parquet files. Default is `.avro_to_parquet_tmp_conversion_`.
Files Suffix	File name suffix for the temporary Parquet files. Default is `.parquet`.
Buffer Size (bytes)	Size of the buffer to use to transfer data.
Rate per Second	Transfer rate to use. Enter a number to specify a rate in bytes per second. Use an expression to specify a rate that uses a different unit of measure per second, e.g. ${5 * MB}. Use -1 to opt out of this property. By default, the origin does not use a transfer rate.

On the Avro to Parquet tab, configure the following properties:


Avro to Parquet Property	Description
Compression Codec	Compression codec to use. If you do not enter a compression code, the executor uses the default compression codec for Parquet.
Row Group Size	Parquet row group size. Use -1 to use the Parquet default.
Page Size	Parquet page size. Use -1 to use the Parquet default.
Dictionary Page Size	Parquet dictionary page size. Use -1 to use the Parquet default.
Max Padding Size	Parquet maximum padding size. Use -1 to use the Parquet default.