For a file dimension, dimension files must be
overwritten since updating a record in a file is not possible. To ensure that new
dimension files contain master data as well as change data, master data must be passed
through the pipeline along with change data.
A file dimension might have a few unpartitioned files or a large set of partitioned
files, like a set of ORC or Parquet files. You configure a file dimension pipeline a bit
differently, depending on whether files are partitioned.
Here's how to configure a file dimension pipeline:
- Pipeline
- For a partitioned file dimension, configure the following properties:
-
- Enable Spark to overwrite partitions dynamically. This allows the
destination to overwrite only the partitioned files with changes.
For more information, see Partitioned File Dimension Prerequisite.
- On the General tab of the pipeline properties panel, enable ludicrous
mode to avoid reading master data that is not related to
the change data, thereby improving pipeline performance.
- With this configuration, if change data includes five records in two
partitions, then the master origin only reads those two partitions of the
master data. And the destination can overwrite only the files in those two
partitions.
- For an unpartitioned file dimension, all dimension data must be read and
written, so no special pipeline properties are required. Each time a batch
of change data is read, the master origin must read the entire dimension.
When writing the change, the destination must overwrite the entire
dimension.
- Origins
- Configure the master origin, the Whole Directory origin, to read the master
dimension data. Configure a change origin to read the change data. Then,
connect them to the Slowly Changing Dimension processor.
- When connected to the processor, each time the change origin reads a batch
of data, the master origin reads the dimension data.
- Unlike most origins, the Whole Directory origin does not cache data or track
offsets, so it can read all of the master dimension data each time the
change origin reads a batch of data. This ensures that comparisons are made
against the latest dimension data.
- Whether the master origin reads the entire dimension or just related master
records depends on whether ludicrous mode is enabled at a pipeline
level.
- Processor
- When both sets of data pass to the Slowly Changing Dimension processor, the
processor compares change records with master records, then passes records
flagged for insert or update downstream.
- Configure the following properties in the processor:
-
- Ensure that the master origin is connected to the master data input
and the change origin to the change data input for the processor.
If they are connected to the wrong locations, you can easily
reverse the connections by clicking the Change Input Order link
on the General tab of the processor.
- To determine how records are evaluated for insert or update, specify
the SCD Type and related
properties.
- List the key fields used to match change fields with master fields.
Note: When processing partitioned dimension files, list the
partition fields after the key fields.
- For Type 2 dimensions, specify one or more tracking field names and
types. For Type 1 dimensions, this is optional.
- Enable the Output Full Master Data property so the master data is
passed to the destination along with the change data.
The
destination can then determine whether to write the entire
master data set or a subset of the master data to the
dimension.
- Optionally configure other properties, such as whether to replace
null values with data from the latest master record and the action
to take when change data includes additional fields.
- Destination
- Configure a dimension destination to write to the master dimension.
- For a partitioned file dimension, configure the following properties:
- On the primary tab of the origin, such as the File tab for the File
origin, select the Exclude Unrelated SCD Master Records property.
This filters out master records that are not related to the
change records.
- On the same tab, set the Write Mode property to Overwrite Related
Partitions.
With this configuration, the destination overwrites only partitions
related to the change records, leaving unchanged partitions as is.
- For an unpartitioned file dimension:
- On the primary tab of the origin, such as the File tab for the File
origin, set the Write Mode property to Overwrite Files.
- Do not enable the Exclude Unrelated SCD Master Records property.
With this configuration, the destination deletes all existing master
dimension files and writes a new master dimension file that contains all
master data with the latest changes.