Slowly Changing Dimension

The Slowly Changing Dimension processor generates updates to a Type 1 or Type 2 slowly changing dimension by evaluating change data against existing dimension data. Use the processor in a pipeline to update slowly changing dimension data stored in a Snowflake table.

When you configure a Slowly Changing Dimension processor, you specify the key columns used to match change and dimension data and the slowly changing dimension type. You can also specify the action to take when change rows have columns that are not included in the dimension table.

When working with a Type 2 dimension, you specify how the dimension tracks changes: active flag column, start and end timestamp columns, or version column. Then, you define the tracking columns to use. When using start and end timestamp columns, you can configure the processor to use a timestamp in the change rows instead of the processing time for start and end timestamps.

When you configure a slowly changing dimension pipeline, it's important to configure the origins and destinations appropriately. For more information, see Slowly Changing Dimension Pipeline.

Slowly Changing Dimension Pipeline

A slowly changing dimension pipeline compares change data against the data in a slowly changing dimension table, then writes the changes to the dimension, making all necessary updates to existing dimension data.

The simplest slowly changing dimension pipeline looks like this:

A slowly changing dimension pipeline includes the following components:
  1. Dimension origin - Reads data from the slowly changing dimension table. The data from this origin must flow into the first input for the Slowly Changing Dimension processor.
  2. Change origin - Reads data from a change data table. The data from this origin must flow into the second input for the Slowly Changing Dimension processor.
  3. Slowly Changing Dimension processor - Compares change data against master data. Flags change rows for insert or update. Adds tracking columns and performs other processing on change rows as needed.
  4. Dimension destination - Writes change data to the slowly changing dimension table. Updates existing dimension data as needed.
When needed, you can add processors to the pipeline. For example, if the change data resides in two different tables, you can use a Join processor to merge the data and a Sort processor to ensure that the data is in the appropriate order.
Important: Typically, you do not enable the Cache Data property in a slowly changing dimension pipeline. Enabling the Cache Data property in pipeline stages can produce unexpected results.

Data Processing

The Slowly Changing Dimension processor supports Type 1 and Type 2 slowly changing dimensions. The processor handles rows differently based on the selected slowly changing dimension type.

Type 1 Processing

A Type 1 slowly changing dimension keeps only the current dimension data. The dimension does not retain previous versions of the data.

When the values of specified key columns in a change row do not exist in the dimension table, the processor flags the change row for insert. When written to the dimension, the change row becomes a new row in the table.

When key column values in a change row match data in the dimension table, the processor flags the change row for update. When written to the dimension, the change row replaces the existing row.

For example, if the dimension table includes a row with a key column value of 005, then change row with a key column value of 005 replaces the existing row.

Type 2 Processing

A Type 2 slowly changing dimension includes the most recent version of dimension data and keeps previous versions of the data for historical reference.
Important: The change data for Type 2 dimensions must not contain columns with the same names as the tracking columns specified in the processor.

As with Type 1 dimensions, when the key columns in a change row do not match data in the dimension table, the processor flags the change row for insert. When written to the dimension, the change row becomes a new row in the table.

When the key columns in a change row match data in the dimension table, the processor handles the row based on the specified tracking columns.

Version column
The dimension table includes a version column that you define in the processor. When key column values in a change row match data in the dimension table, the processor adds the version column to the change row and increments the existing version by one. No changes to existing dimension data are needed.
For example, say the dimension includes a row with an Id key column value of 005 and a Version column set to 2. When a change row includes 005 in the Id column, the processor adds a new Version column to the change row with a value of 3. The existing row remains unchanged.
Active flag column
The dimension table includes an active flag column that you define in the processor. You also specify the values used in the active column, such as Y/N, true/false, or 0/1. When needed, you can define custom values for the active flag column.
When key column values in a change row match data in the dimension table, the processor adds the active column to the change row and sets the value to the equivalent of true. The processor also updates the matching row in the dimension table, setting the active column to the equivalent of false.
For example, say the dimension includes a row with an Id key column value of 005 and an Active column of true. When a change row includes 005 in the Id column, the processor adds a new Active column to the change row with a value of true. The processor also updates the existing 005 row, changing the Active column value from true to false.
Timestamp columns
The dimension table includes start and end timestamp columns that you define in the processor. A row is considered active when it does not include an end time.
When key column values in a change row match data in the dimension table, the processor adds the start and end timestamp columns to the change row. It uses the processing time for the start timestamp column and leaves the end timestamp column empty. The processor also updates the matching row in the dimension table, adding the processing time to the end timestamp column.
For example, say the dimension includes a row with an Id key column value of 005, a startTime value, and no endTime value, which indicates that the row is active. When a change row includes 005 in the Id column, the processor adds the startTime and endTime columns to the change row and sets the startTime to the processing time. It also updates the existing 005 row, adding the processing time to the previously empty endTime column.
You can configure the processor to use a timestamp in the change data for the start and end times for the dimension. Simply enable the Specify Column for New Timestamps property, then in the Calculated Timestamp Column property, specify the timestamp column name in the change data. When using a timestamp in the change data for the start and end times, the processor removes the specified timestamp column from the change data.
For example, say the change data in the previous example includes a time timestamp column that indicates when the change occurred, and you want to use that timestamp for the start and end timestamps for the dimension. In the processor, you enable specifying the column, then set the Calculated Timestamp Column property to time. When the pipeline runs, the processor uses the timestamp in the time column instead of the processing time for both the startTime column in the change row and the endTime column in the existing 005 row. The processor also removes the time column from the change rows.

Pipeline Configuration Guidelines

Use the following guidelines when configuring a slowly changing dimension pipeline:
Dimension and change data origins
Configure the dimension origin and change data origin using the following guidelines:
  1. Ensure that the change data origin passes rows in the appropriate order. The Slowly Changing Dimension processor handles change data in the order that it is received.

    When necessary, you can use processors such as the Sort processor to order change data before passing it to the Slowly Changing Dimension processor.

  2. For Type 2 dimensions, ensure that the change data does not contain columns with the same names as the tracking columns specified in the processor.
  3. Typically, you do not enable the Cache Data property in the dimension or change data origin. This ensures that the pipeline reads the latest updates in both the dimension table and the change data table.

    Similarly, avoid caching data in all processors that you add to the pipeline.

Slowly Changing Dimension processor
Configure the processor using the following guidelines:
  1. Link the dimension origin to the processor, then link the change data origin. The first origin connects to the master data input, and the second connects to the change data input.
    Important: The dimension origin must be connected to the master data input. If the origins are connected incorrectly, select the processor in the pipeline canvas. Then in the pop-up menu, select the Reorder icon: .

    For example, in the following pipeline, you can use the Reorder icon to correct the origin inputs:

  2. On the Dimension tab, configure the SCD Type property.
  3. In the Key Columns property, list one or more key columns to use when comparing change rows with master rows.

    For example, if the dimension uses a storeId column as a primary key, add the column name to the Key Columns property.

  4. For Type 2 dimensions, specify one or more of the following tracking columns: version, active flag, or timestamp.
  5. For the Behavior for New Columns property, specify the action to take when change data includes additional columns:
    • Keep new columns - Adds new columns to the dimension table so data in new columns can be included in dimension data. Uses null values for rows and rows that do not include the new columns.
    • Remove from change data - Removes columns from the change data that do not already exist in the dimension table.
Dimension destination
Configure the slowly changing dimension destination using the following guidelines:
  1. Specify the name of the slowly changing dimension table.
  2. Set Write Mode to Propagate updates from Slowly Changing Dimension processor.
    Important: To process data correctly, the destination must be configured to use this write mode.
  3. Typically the dimension table already exists, but if it does not, you can select Create Table to have the destination create the table.

    When enabled, the destination creates the table based on the structure of rows in the first batch of data.

  4. To allow the pipeline to add new columns to the dimension table, select Data Drift Enabled. If you do not want the table structure to change, disable the property.

    If the Slowly Changing Dimension processor has the Behavior for New Columns property set to Keep new columns, then enable data drift handling so the destination can add new columns to the dimension table.

Example

Say you want to update a Type 2 slowly changing dimension that uses start and end timestamp columns to indicate which rows are active.

Source Data

The store dimension includes the following rows:
storeid city state start_date end_date
CA-0001 San Francisco CA 1996-03-09 00:00:00.000 null
CA-0002 Los Angeles CA 2010-07-08 00:00:00.000 null
MD-0001 Rockville MD 2000-11-10 00:00:00.000 null
MD-0001 Bethesda MD 1995-01-01 00:00:00.000 2000-11-10 00:00:00.000

Null values in the end_date column indicate that the first three rows are active stores. The end_date value for the last row indicates that the store in Bethesda closed in 2000. The time portion of the timestamps are set to 00:00:00.000 because the time was not originally included in the data.

The change data table includes the following rows:
storeid city state
CA-0001 Colma CA
MD-0001 Germantown MD
GA-0001 Atlanta GA

This indicates that CA-0001 moved from San Francisco to Colma, MD-0001 moved from Rockville to Germantown, and a new store opened in Atlanta.

Pipeline Configuration

You configure the pipeline as described in Pipeline Configuration Guidelines:
  1. Create an origin for the dimension data and an origin for the change data, and disable caching for both origins.

    Then, link both origins to the Slowly Changing Dimension processor, starting with the dimension origin so it connects to the master data input.

    Important: The dimension origin must be connected to the master data input. If the origins are connected incorrectly, select the processor in the pipeline canvas. Then in the pop-up menu, select the Reorder icon: .
  2. Configure the Slowly Changing Dimension processor to perform a Type 2 dimension update using the start_date and end_date timestamp columns, as follows:

  3. Configure the destination to write to the dimension table and to use the Propagate updates from Slowly Changing Dimension processor write mode.

Pipeline Processing

When you run the pipeline, the Slowly Changing Dimension processor compares the dimension data with the change data. It adds the start_date and end_date columns to the change data, and populates the start_date column with the time of processing.

The destination writes the three change rows to the dimension table and adds the time of processing to the end_date columns for the closed San Francisco and Rockville stores. The updated dimension table contains the following data:
storeid city state start_date end_date
CA-0001 Colma CA

2022-12-16 15:03:28.527

null
GA-0001 Atlanta GA

2022-12-16 15:03:28.527

null
MD-0001 Germantown MD

2022-12-16 15:03:28.527

null
CA-0002 Los Angeles CA 2010-07-08 00:00:00.000 null
CA-0001 San Francisco CA 1996-03-09 00:00:00.000

2022-12-16 15:03:28.527

MD-0001 Rockville MD 2000-11-10 00:00:00.000

2022-12-16 15:03:28.527

MD-0001 Bethesda MD 1995-01-01 00:00:00.000 2000-11-10 00:00:00.000

Configuring a Slowly Changing Dimension Processor

Configure a Slowly Changing Dimension processor as part of a slowly changing dimension pipeline that updates a Type 1 or Type 2 slowly changing dimension. Before you configure a slowly changing dimension pipeline, consider the processing that you want to achieve.

  1. On the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
    Cache Data Caches processed data.
  2. On the Dimension tab, configure the following properties:
    Dimension Property Description
    SCD Type Type of slowly changing dimension to work with:
    • Type 1 - The dimension contains only active rows.
    • Type 2 - The dimension contains active rows and earlier versions of those rows for historical reference.
    Key Columns One or more columns in change rows used to determine if a matching row exists in the dimension table. You can enter column names or select columns from preview data.
    Specify Version Column Enables specifying a version column.
    Version Column Name of the version column in the dimension data. You can enter the column name or select a column from preview data.
    Specify Active Flag Enables specifying an active flag column.
    Active Flag Column Name of the active flag column in the dimension data. You can enter the column name or select a column from preview data.
    Specify Timestamp Columns Enables specifying start and end timestamp columns.
    Start Timestamp Column Name of the start timestamp column in the dimension data. You can enter the column name or select a column from preview data.
    End Timestamp Column Name of the end timestamp column in the dimension data. You can enter the column name or select a column from preview data.
    Specify Column for New Timestamps Enables specifying a timestamp column in change data to use as the start and end timestamp values, instead of the pipeline processing time.

    The specified timestamp column is removed from the change data.

    Calculated Timestamp Column Timestamp column in change data to use. This column is removed from change data after rows are evaluated. You can enter the column name or select a column from preview data.
    Behavior for New Columns Action to take when change data includes columns that do not exist in the dimension table:
    • Keep for newest rows and set to null for previous rows
    • Remove from change data
    Tip: When keeping new columns, enable data drift in the pipeline destination to allow it to update the dimension table.