Slowly Changing Dimension
The Slowly Changing Dimension processor generates updates to a Type 1 or Type 2 slowly changing dimension by evaluating change data against master dimension data. Use the processor in a pipeline to update slowly changing dimension data stored in a database table or a set of files.
When you configure a Slowly Changing Dimension processor, you specify the slowly changing dimension type and the key fields used to match change and master data.
You can specify the action to take when a change record includes fields not included in the master records. You can replace null values in change data with corresponding master data. You can also include all master data in the processor output so the destination can overwrite file dimensions as needed.
You can specify one or more tracking fields and configure the tracking type for the fields. You can also specify a subset of fields to trigger a Type 2 change.
When you configure a slowly changing dimension pipeline it's important to configure pipeline properties and all stage properties appropriately for your use case. For more information, see Pipeline Configuration.
Slowly Changing Dimension Pipeline
A slowly changing dimension pipeline compares change data against master dimension data, then writes the changes to the master dimension data.
A slowly changing dimension pipeline can process a traditional table dimension, where the dimension data is stored in a database table. It can also process a file dimension, where the dimension data is stored in a set of files in a directory.
The simplest slowly changing dimension pipeline looks like this:
- Master origin - Reads the master dimension data. Use one of the following
origins:
- Whole Directory - Use to read a file dimension. The dimension files must reside within a single directory, but can include partitions. No non-dimension files should exist in the directory.
- JDBC Table origin - Use to read a table dimension.
- Change origin - Reads change data. Change data can be read by any Transformer origin.
- Slowly Changing Dimension processor - Compares change data against master data and flags change records for insert or update.
- Dimension destination - Writes results to the master
dimension. Use one of the following destinations:
- ADLS Gen1 - Use to write to a file dimension on Azure Data Lake Storage Gen1.
- ADLS Gen2 - Use to write to a file dimension on Azure Data Lake Storage Gen2.
- Amazon S3 - Use to write to a file dimension on Amazon S3.
- File - Use to write to a file dimension on HDFS or a local file system.
- JDBC - Use to write to a database table dimension.
Pipeline Configuration
You configure a slowly changing dimension pipeline differently, depending on whether it updates a dimension stored in a table or in files.
Table Dimension Pipeline
- Pipeline
- On the General tab of the pipeline properties panel, enable ludicrous mode to read only master data that is related to the change data, and thereby improve pipeline performance,
- Origins
- Configure the master origin, the JDBC Table origin, to read the master dimension data. Configure a change origin to read change data. Then, connect them to the Slowly Changing Dimension processor.
- Processor
- When both sets of data pass to the Slowly Changing Dimension processor, the processor compares change records with master records, then passes records flagged for insert or update downstream.
- Destination
- Configure the JDBC destination to write to the master dimension.
File Dimension Pipeline
For a file dimension, dimension files must be overwritten since updating a record in a file is not possible. To ensure that new dimension files contain master data as well as change data, master data must be passed through the pipeline along with change data.
You can organize a file dimension as a partitioned file dimension or an unpartitioned file dimension. A partitioned file dimension has a set of partition files grouped by field values. An unpartitioned file dimension has partition files stored in one directory. You configure a file dimension pipeline a bit differently for each type of file dimension.
Here's how to configure a file dimension pipeline:
Pipeline
- Partitioned file dimension
-
For a partitioned file dimension, configure the following properties:
- Enable Spark to overwrite partitions dynamically. This
allows the destination to overwrite only the partition files
with changes.
For more information, see Partitioned File Dimension Prerequisite.
- On the General tab of the pipeline properties panel, enable ludicrous mode to avoid reading master data that is not related to the change data, thereby improving pipeline performance.
With this configuration, if change data includes five records in two partitions, then the master origin only reads those two partitions of the master data. And the destination can overwrite only the files in those two partitions.
- Enable Spark to overwrite partitions dynamically. This
allows the destination to overwrite only the partition files
with changes.
- Unpartitioned file dimension
- For an unpartitioned file dimension, all dimension data must be read and written, so no special pipeline properties are required. Each time a batch of change data is read, the master origin must read the entire dimension. When writing the change, the destination must overwrite the entire dimension.
Origins
Configure the master origin, the Whole Directory origin, to read the master dimension data. Configure a change origin to read the change data. Then, connect them to the Slowly Changing Dimension processor.
When connected to the processor, each time the change origin reads a batch of data, the master origin reads the dimension data.
Unlike most origins, the Whole Directory origin does not cache data or track offsets, so it can read all of the master dimension data each time the change origin reads a batch of data. This ensures that comparisons are made against the latest dimension data.
Whether the master origin reads the entire dimension or just related master records depends on whether ludicrous mode is enabled at a pipeline level.
Processor
When both sets of data pass to the Slowly Changing Dimension processor, the processor compares change records with master records, then passes records flagged for insert or update downstream.
- Ensure that the master origin is connected to the master data input and the
change origin to the change data input for the processor.
If they are connected to the wrong locations, you can easily reverse the connections by clicking the Change Input Order link on the General tab of the processor.
- To determine how records are evaluated for insert or update, specify the SCD Type and related properties.
- List the key fields used to match change fields with master fields.Note: When processing partitioned dimension files, list the partition fields after the key fields.
- For Type 2 dimensions, specify one or more tracking field names and types. For Type 1 dimensions, this is optional.
- Enable the Output Full Master Data property so the master data is passed to
the destination along with the change data.
The destination can then determine whether to write the entire master data set or a subset of the master data to the dimension.
- Optionally configure other properties, such as whether to replace null values with data from the latest master record and the action to take when change data includes additional fields.
Destination
- Partitioned file dimension
-
For a partitioned file dimension, configure the following properties:
- Select the Exclude Unrelated SCD Master Records property to filter out master records that are not related to the change records.
- Set the Write Mode property to Overwrite Related Partitions to overwrite only partition files in groups with changes.
- In the Partition by Fields property, enter the fields from the Key Fields property in the Slowly Changing Dimension processor to group files for the dimension by these fields.
With this configuration, the destination overwrites only partition files related to the change records, leaving unchanged partition files as is.
- Unpartitioned file dimension
-
For an unpartitioned file dimension, configure the following properties:
- Set the Write Mode property to Overwrite Files.
- Do not enable the Exclude Unrelated SCD Master Records property.
With this configuration, the destination deletes all existing master dimension files and writes a new master dimension file that contains all master data with the latest changes.
Partitioned File Dimension Prerequisite
To write to a partitioned file dimension, configure Spark to overwrite partitions dynamically.
Set the Spark configuration property
spark.sql.sources.partitionOverwriteMode
to
dynamic
.
You can configure the property in Spark, or you
can configure the property in individual pipelines. For example, you might set the
property to dynamic
in Spark when you plan to enable the Overwrite
Related Partitions mode in most of your pipelines, and then set the property to
static
in individual pipelines that do not use that mode.
To configure the property in an individual pipeline, add an extra Spark configuration property on the Cluster tab of the pipeline properties.
Change Processing
- SCD Type - Type of slowly changing dimension, Type 1 or Type 2. For more information, see Type 1 and Type 2 Change Evaluation.
- Key Fields - One or more fields used to determine if a matching record exists in the master dimension data.
- Extra Field Mode - Action to
take when a change record includes fields that do not exist in the
corresponding master record:
- Drop - Drops extra fields from the record.
- Error - Generates an error, which stops the pipeline.
Type 1 and Type 2 Change Evaluation
- Type 1
- A Type 1 slowly changing dimension keeps only a single version of a record.
- Type 2
- A Type 2 slowly changing dimension has an active record - the most recent version of a record - and keeps previous versions of the record for historical reference.
Processor Output
By default, the Slowly Changing Dimension processor outputs change records that are flagged for insert or update.
The following properties can alter how records are processed and passed:
- Tracking Fields
- The processor adds the specified tracking fields to change records or updates existing tracking fields with the appropriate values.
- Output Full Master Data
- The processor includes all existing master data in the output. Use this option when writing to a slowly changing dimension that must be overwritten instead of updated, such as a file dimension, to ensure that the master dimension data is written along with the latest changes.
- Replace Nulls
- For Type 2 update records, the processor replaces null or missing values in the change data with the values from the latest version of the record in the master dimension data.
Tracking Fields
A tracking field is a field in a Type 2 slowly changing dimension that indicates which record is the most recent, or active, record. Type 2 dimensions can use several types of tracking fields.
Type 1 dimensions simply replace the existing record, so no earlier versions are retained. However, a Type 1 dimension can use a timestamp tracking field, such as a last-updated field, to indicate when the record was updated.
When you configure the Slowly Changing Dimension processor, you specify the tracking field name and type. You can specify as many tracking fields as appropriate.
- Version Increment
- The processor increments each version of the record and places the latest version number in a user-defined field.
- Active Flag
- The processor uses a boolean
active
field. The most recent version is flagged astrue
and older versions are flagged asfalse
. - As Of / Start Timestamp
- The processor places the datetime in a user-defined field each time it writes each new version. The record with the most recent datetime is the most recent record.
- End Timestamp
- The processor uses a user-defined field to indicate when a record version is no longer used. The record with no end timestamp is the most recent record.
Configuring a Slowly Changing Dimension Processor
Configure a Slowly Changing Dimension processor as part of a slowly changing dimension pipeline that updates a slowly changing dimension.
Before configuring a slowly changing dimension pipeline, consider the processing that you want to achieve. When writing to a partitioned file dimension, complete the Spark prerequisite.
For information about configuring a slowly changing dimension pipeline, see Pipeline Configuration.