File

The File destination writes files to Hadoop Distributed File System (HDFS) or a local file system. The File destination writes data based on the specified data format and creates a separate file for every partition.

The File destination writes to HDFS using connection information stored in a Hadoop configuration file.

You specify the output directory and write mode to use. You can configure the destination to group partition files by field values. If you configure the destination to overwrite related partitions, you must configure Spark to overwrite partitions dynamically. You can also configure the destination to drop unrelated master records when using the destination as part of a slowly changing dimension pipeline.

You select the data format to write and configure related properties.

You can also specify HDFS configuration properties for a HDFS-compatible system. Any specified properties override those defined in the Hadoop configuration file.

Directory Path

The File destination writes files to a directory in HDFS or a local file system.

To specify the directory, enter the path to the directory. The format of the directory path depends on the file system that you want to write to:

HDFS: To write files to HDFS, use the following format for the directory path:; hdfs://<authority>/<path>; For example, to write to the /user/hadoop/files directory on HDFS, enter the following path:; hdfs://nameservice/user/hadoop/files
Local file system: To write files to a local file system, use the following format for the directory path:; file:///<directory>; For example, to write to the /Users/transformer/output directory on the local file system, enter the following path:; file:///Users/transformer/output

The destination creates the directory if it doesn't exist. The user that Transformer uses to launch the Spark application must have write access to the root of the specified directory path. For a cluster pipeline, the user that launches the Spark application depends on the cluster type configured for the pipeline. For a local pipeline, the user that launches the Spark application is the user that runs the Transformer process.

Write Mode

The write mode determines how the File destination writes files to the destination system. When writing files, the resulting file names are based on the data format of the files.

The File destination includes the following write modes:

Overwrite files

Before writing data from a batch, the destination removes all files from the output directory and any subdirectories.

Overwrite related partitions

Before writing data from a batch, the destination removes the files from subdirectories for which the batch has data. The destination leaves subdirectories intact if the batch has no data for that subdirectory.

For example, say you have a directory with ten subdirectories. If the processed data belongs in two subdirectories, the destination overwrites the two subdirectories with the new data. The other eight subdirectories remain unchanged.

Use this mode to overwrite only affected files, like when writing to a slowly changing grouped file dimension.

The Partition by Fields property specifies the fields used to create subdirectories. With no specified fields, the destination creates no subdirectories and this mode works the same as the Overwrite Files mode.

Note: This mode requires you to configure Spark to overwrite partitions dynamically.

Write new files to new directory

When the pipeline starts, the destination creates the output directory and writes new files to that directory. Before writing data for each subsequent batch during the pipeline run, the destination removes all files from the directory and any subdirectories. The destination generates an error if the output directory exists when the pipeline starts.

Write new or append to existing files

The destination creates new files if the files do not exist, or appends data to existing files if the files already exist in the output directory or subdirectory.

Spark Requirement to Overwrite Related Partitions

If you set Write Mode to Overwrite Related Partitions, you must configure Spark to overwrite partitions dynamically.

Set the Spark configuration property spark.sql.sources.partitionOverwriteMode to dynamic.

You can configure the property in Spark, or you can configure the property in individual pipelines. For example, you might set the property to dynamic in Spark when you plan to enable the Overwrite Related Partitions mode in most of your pipelines, and then set the property to static in individual pipelines that do not use that mode.

To configure the property in an individual pipeline, add an extra Spark configuration property on the Cluster tab of the pipeline properties.

Partition Files

Pipelines process data in partitions. The File destination writes files, which contain the processed data, in the configured output directory. The destination writes one file for each partition.

The destination groups the partition files if you specify one or more fields in the Partition by Fields property. For each unique value of the fields specified in the Partition by Fields property, the destination creates a subdirectory. In each subdirectory, the destination writes one file for each partition that has the corresponding field and value. Because the subdirectory name includes the field name and value, the file omits that data. With grouped partition files, you can more easily find data with certain values, such as all the data for a particular city.

To overwrite subdirectories that have updated data and leave other subdirectories intact, set the Write Mode property to Overwrite Related Partitions. Then, the destination clears affected subdirectories before writing, replacing files in those subdirectories with new files.

If the Partition by Fields property lists no fields, the destination does not group partition files. Instead, the destination writes one file for each partition directly in the output directory.

Because the text data format only contains data from one field, the destination does not group partition files for the text data format. Do not configure the Partition by Fields property if the destination writes in the text data format.

Example: Grouping Partition Files

Suppose your pipeline processes orders. You want to write data in the Sales directory and group the data by cities. Therefore, in the Directory Path property, you enter Sales, and in the Partition by Fields property, you enter City. Your pipeline is a batch pipeline that only processes one batch. You want to overwrite the entire Sales directory each time you run the pipeline. Therefore, in the Write Mode property, you select Overwrite Files.

As the pipeline processes the batch, the origins and processors lead Spark to split the data into three partitions. The batch contains two values in the City field: Oakland and Fremont. Before writing the processed data in the Sales directory, the destination removes any existing files and subdirectories and then creates two subdirectories, City=Oakland and City=Fremont. In each subdirectory, the destination writes one file for each partition that contains data for that city, as shown below:

Note that the written files do not include the City field; instead, you infer the city from the subdirectory name. The first partition does not include any data for Fremont; therefore, the City=Fremont subdirectory does not contain a file from the first partition. Similarly, the third partition does not include any data for Oakland; therefore, the City=Oakland subdirectory does not contain a file from the third partition.

Data Formats

The File destination writes records based on the specified data format.

The destination can write using the following data formats:

Avro

The destination writes an Avro file for each partition and includes the Avro schema in each file.