Offset Handling

Transformer uses offsets to track the progress of processing. Transformer tracks offsets for most origins and the Surrogate Key Generator processor:

Origin offsets

An origin offset enables Transformer to keep track of the data that has been processed and, thus, where it should continue processing.

Transformer keeps track of the data that has been processed by storing an offset each time a batch of data is read, processed, and written. The offset is saved after Transformer receives confirmation from destination systems that the batch has been written to the system.

Offsets are stored between pipeline runs, by default. When a pipeline with origins that store offsets comes to a graceful stop, Transformer stores the offsets for the pipeline run by default. This allows Transformer to begin processing where it left off the next time you start the pipeline. A graceful stop is one where Transformer performs all expected tasks before stopping. This does not include force-stops or cases like a shutdown of the Transformer machine before Transformer is stopped.

You can configure any origin that tracks offsets to skip tracking the offset. Also, you can reset all pipeline offsets when you start a pipeline to have Transformer run the pipeline as if for the first time.

Transformer maintains offsets for all origins that can be included in both batch and streaming pipelines. Transformer does not maintain offsets for the following origins that can be included in batch pipelines only:

Delta Lake
Kudu
Whole Directory

Since these origins do not track offsets, they read all available data each time that the pipeline runs.

Processor offset

Transformer tracks an offset for the Surrogate Key Generator processor as the pipeline runs to ensure that the processor does not generate duplicate keys.

By default, Transformer saves the offset between pipeline runs so that when you restart the pipeline, the Surrogate Key Generator processor continues generating keys larger than the last-saved offset.

If you reset pipeline offsets while starting a pipeline, the offset for the Surrogate Key Generator processor is reset as well. As a result, the processor starts key generation with the specified initial value.

Skip Offset Tracking

You can configure any origin that tracks offsets to skip tracking offsets. You cannot configure the Surrogate Key Generator processor to skip tracking offsets.

Skip offset tracking when you want an origin to treat every batch like the pipeline just started running for the first time. This can be appropriate in certain situations.

For example, say you want a pipeline to process all data in a Hive table every time you run the pipeline. To get the desired results, you use the Hive origin in a batch pipeline to read all data in a single batch. Then, you enable the Skip Offset Tracking property in the origin to ensure that all data is processed with each pipeline run. If you allow offset tracking, the pipeline reads all available data in the first pipeline run, but in subsequent runs, it reads only the data that arrived since the last pipeline run.

Skipping offset tracking is critical in a slowly changing dimension streaming pipeline, where you want to compare change data against the latest master dimension data. In this case, you skip offset tracking in the master origin, so the master origin reads the master dimension data every time the pipeline processes data from the change origin. This allows the Slowly Changing Dimension processor to compare changes against the master dimension data. If you don't skip offset tracking, the master origin only reads new master dimension data, providing an incomplete master data set for comparison.

Skipping offset tracking can also be totally inappropriate, so you should skip offset tracking with care.

Note that most streaming pipelines require offset tracking to function as expected. For example, you typically want a Kafka origin to read messages from the specified initial offset, to process all existing messages from that point forward, and to continue processing newly arrived messages. If you skip offset tracking, the origin reprocesses data from the initial offset with each batch.

To skip tracking offsets, on the General tab of the origin, select the Skip Offset Tracking property. If the origin does not have the property, it does not track offsets.

Reset Pipeline Offsets

You can optionally reset all pipeline offsets before starting a pipeline. When you reset pipeline offsets, Transformer runs the pipeline like it is the very first pipeline run.

For example, say you have a batch pipeline that runs weekly. It includes an ADLS Gen2 origin that reads files from a /logs directory. After the pipeline processes all available data, the origin notes the offset - in this case, the last-modified timestamp of the last processed file. Then, the pipeline comes to a stop. The next time you run the pipeline, the pipeline processes only files with a last-modified timestamp after that offset.

Now, let's say you need to change the destination system that the pipeline writes to, and you want to reprocess all available data to write the results to the new destination system. To do this, you replace the destination in the pipeline. Then, when you start the pipeline, you use the Reset Offsets and Start option.

The pipeline processes all available data in a single batch and stops. As before, it stores the offset. Then on subsequent pipeline runs, it continues processing from the last-saved offset.

To reset pipeline offsets before starting a pipeline, click the menu arrow to the right of the Start button (), then click Reset Offsets & Start.