File Dimension Pipeline

For a file dimension, dimension files must be overwritten since updating a record in a file is not possible. To ensure that new dimension files contain master data as well as change data, master data must be passed through the pipeline along with change data.

A file dimension might have a few unpartitioned files or a large set of partitioned files, like a set of ORC or Parquet files. You configure a file dimension pipeline a bit differently, depending on whether files are partitioned.

Here's how to configure a file dimension pipeline:
Pipeline
For a partitioned file dimension, configure the following properties:
  1. Enable Spark to overwrite partitions dynamically. This allows the destination to overwrite only the partitioned files with changes.

    For more information, see Partitioned File Dimension Prerequisite.

  2. On the General tab of the pipeline properties panel, enable ludicrous mode to avoid reading master data that is not related to the change data, thereby improving pipeline performance.
With this configuration, if change data includes five records in two partitions, then the master origin only reads those two partitions of the master data. And the destination can overwrite only the files in those two partitions.
For an unpartitioned file dimension, all dimension data must be read and written, so no special pipeline properties are required. Each time a batch of change data is read, the master origin must read the entire dimension. When writing the change, the destination must overwrite the entire dimension.
Origins
Configure the master origin, the Whole Directory origin, to read the master dimension data. Configure a change origin to read the change data. Then, connect them to the Slowly Changing Dimension processor.
When connected to the processor, each time the change origin reads a batch of data, the master origin reads the dimension data.
Unlike most origins, the Whole Directory origin does not cache data or track offsets, so it can read all of the master dimension data each time the change origin reads a batch of data. This ensures that comparisons are made against the latest dimension data.
Whether the master origin reads the entire dimension or just related master records depends on whether ludicrous mode is enabled at a pipeline level.
Processor
When both sets of data pass to the Slowly Changing Dimension processor, the processor compares change records with master records, then passes records flagged for insert or update downstream.
Configure the following properties in the processor:
  1. Ensure that the master origin is connected to the master data input and the change origin to the change data input for the processor.

    If they are connected to the wrong locations, you can easily reverse the connections by clicking the Change Input Order link on the General tab of the processor.

  2. To determine how records are evaluated for insert or update, specify the SCD Type and related properties.
  3. List the key fields used to match change fields with master fields.
    Note: When processing partitioned dimension files, list the partition fields after the key fields.
  4. For Type 2 dimensions, specify one or more tracking field names and types. For Type 1 dimensions, this is optional.
  5. Enable the Output Full Master Data property so the master data is passed to the destination along with the change data.

    The destination can then determine whether to write the entire master data set or a subset of the master data to the dimension.

  6. Optionally configure other properties, such as whether to replace null values with data from the latest master record and the action to take when change data includes additional fields.
Destination
Configure a dimension destination to write to the master dimension.
For a partitioned file dimension, configure the following properties:
  1. On the primary tab of the origin, such as the File tab for the File origin, select the Exclude Unrelated SCD Master Records property.

    This filters out master records that are not related to the change records.

  2. On the same tab, set the Write Mode property to Overwrite Related Partitions.
With this configuration, the destination overwrites only partitions related to the change records, leaving unchanged partitions as is.
For an unpartitioned file dimension:
  1. On the primary tab of the origin, such as the File tab for the File origin, set the Write Mode property to Overwrite Files.
  2. Do not enable the Exclude Unrelated SCD Master Records property.
With this configuration, the destination deletes all existing master dimension files and writes a new master dimension file that contains all master data with the latest changes.