Surrogate Key Generator

The Surrogate Key Generator processor generates a unique surrogate key for each record. Each generated key value is a 64-bit integer.

The processor is guaranteed to generate unique keys within a pipeline run, as long as the pipeline is processed in less than 1 billion partitions, and each partition has less than 8 billion records. The processor might generate duplicate keys on subsequent runs, based on how you configure the processor and the volume of data being processed.

Use the processor to generate a unique ID for each record. For example, if a pipeline processes order data that doesn't include a primary key, you can add the Surrogate Key Generator processor to create a surrogate key for each record of order data.

When pipeline data is split into partitions, the Surrogate Key Generator processor uses a different range of key values for each partition to ensure that generated keys are unique. The processor generates keys for the first partition using the starting value. Then for each additional partition, the processor increments the starting value of the subsequent partition by 2 to the power of 33, or 8589934592.

Within each partition, the processor increments the key value by 1 for each record.

For example, Spark processes a pipeline in three partitions and each partition contains three records. The Surrogate Key Generator processor generates the following unique key values for each partition:
  • Partition1 - 2, 3, 4
  • Partition2 - 8589934594, 8589934595, 8589934596
  • Partition3 - 17179869186, 17179869187, 17179869188

Transformer tracks an offset for the Surrogate Key Generator processor as the pipeline runs. When the pipeline stops, the processor saves the largest key generated among all partitions as the last-saved offset. When you start the pipeline again, the processor continues generating keys larger than the last-saved offset, as long as the offset does not exceed the configured maximum value.

When you configure the Surrogate Key Generator processor, you specify the output field to pass the generated key value to.

By default, the processor generates 2 as the first key value. To change the starting value, you can configure the initial, minimum, and maximum values that the processor uses.

Starting Key Value

The Surrogate Key Generator processor compares several values to determine the key value to start with, based on how the pipeline starts:
First pipeline start or subsequent pipeline start with reset offsets
When a pipeline first starts or when it starts with reset offsets, the processor uses the configured initial value as the starting key value unless the initial value equals or exceeds the configured maximum value. In this case, the processor uses the configured minimum value as the starting key value.
By default, the processor uses the same initial and maximum value and a minimum value of 1. As a result, when a pipeline first starts or when it starts with reset offsets, the processor uses 1 as the starting key value.
Subsequent pipeline start
When a pipeline starts after completing an initial run, the processor uses the last-saved offset as the starting key value unless the offset equals or exceeds the configured maximum value. In this case, the processor uses the configured minimum value as the starting key value.

The processor increments the starting value by 1 to generate the first key value. For example, if the starting value is 1, the processor generates 2 as the first key value.

The processor compares the values only when the pipeline starts. As a result, a single pipeline run can exceed the configured maximum value.

For example, let's say that you set the following values for the processor:
  • Minimum Value = 0
  • Maximum Value = 500000
  • Initial Value = 5

When the pipeline first starts, the processor uses the initial value of 5 as the starting value. The processor generates 6 as the first key value. During the initial run, the pipeline processes 100 records in two partitions. When the pipeline stops, the last-saved offset is 8589934644. You start the pipeline again. The last-saved offset exceeds the configured maximum value of 500000. The processor starts over with the minimum value, using 0 as the starting value. The processor generates 1 as the first key value for the second pipeline run, which results in duplicate keys being generated for the second pipeline run.

Important: Use caution when changing the default maximum value. If you set the maximum value to a low number relative to the number of partitions and the number of processed records, the processor can create duplicate keys in later pipeline runs.

Configuring a Surrogate Key Generator Processor

Configure a Surrogate Key Generator processor to generate a unique surrogate key for each record.

  1. In the Properties panel, on the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
    Cache Data Caches data processed for a batch so the data can be reused for multiple downstream stages. Use to improve performance when the stage passes data to multiple stages.

    Caching can limit pushdown optimization when the pipeline runs in ludicrous mode.

  2. On the Key Generator tab, configure the following options:
    Key Generator Property Description
    Output Field Field to pass the generated key value to.
    Minimum Value Minimum key value to start with.

    Used when the initial value or the last-saved offset equals or exceeds the maximum value as the pipeline starts.

    Default is 1.

    Maximum Value Maximum key value to start with.

    If the initial value or the last-saved offset equals or exceeds the maximum value, the processor starts with the minimum value.

    The processor checks the maximum value only when the pipeline starts. As a result, a single pipeline run can exceed the maximum value.

    Important: Use caution when changing the default maximum value. If you set the maximum value to a low number relative to the number of partitions and the number of processed records, the processor can create duplicate keys in later pipeline runs.

    Default is 9223372036754776000.

    Initial Value Initial key value to start with.

    Used when the pipeline first starts or when the pipeline starts with reset offsets and the initial value does not equal or exceed the maximum value.

    Default is the same as the maximum value, which causes the processor to use the minimum value as the starting key value.