Offset Handling
- Origin offsets
- An origin offset enables Transformer to keep track of the data that has been processed and, thus, where it should continue processing.
- Processor offset
- Transformer tracks an offset for the Surrogate Key Generator processor as the pipeline runs to ensure that the processor does not generate duplicate keys.
Skip Offset Tracking
You can configure any origin that tracks offsets to skip tracking offsets. You cannot configure the Surrogate Key Generator processor to skip tracking offsets.
Skip offset tracking when you want an origin to treat every batch like the pipeline just started running for the first time. This can be appropriate in certain situations.
For example, say you want a pipeline to process all data in a Hive table every time you run the pipeline. To get the desired results, you use the Hive origin in a batch pipeline to read all data in a single batch. Then, you enable the Skip Offset Tracking property in the origin to ensure that all data is processed with each pipeline run. If you allow offset tracking, the pipeline reads all available data in the first pipeline run, but in subsequent runs, it reads only the data that arrived since the last pipeline run.
Skipping offset tracking is critical in a slowly changing dimension streaming pipeline, where you want to compare change data against the latest master dimension data. In this case, you skip offset tracking in the master origin, so the master origin reads the master dimension data every time the pipeline processes data from the change origin. This allows the Slowly Changing Dimension processor to compare changes against the master dimension data. If you don't skip offset tracking, the master origin only reads new master dimension data, providing an incomplete master data set for comparison.
Skipping offset tracking can also be totally inappropriate, so you should skip offset tracking with care.
Note that most streaming pipelines require offset tracking to function as expected. For example, you typically want a Kafka origin to read messages from the specified initial offset, to process all existing messages from that point forward, and to continue processing newly arrived messages. If you skip offset tracking, the origin reprocesses data from the initial offset with each batch.
To skip tracking offsets, on the General tab of the origin, select the Skip Offset Tracking property. If the origin does not have the property, it does not track offsets.
Reset Pipeline Offsets
You can optionally reset all pipeline offsets before starting a pipeline. When you reset pipeline offsets, Transformer runs the pipeline like it is the very first pipeline run.
For example, say you have a batch pipeline that runs weekly. It includes an ADLS Gen2 origin that reads files from a /logs directory. After the pipeline processes all available data, the origin notes the offset - in this case, the last-modified timestamp of the last processed file. Then, the pipeline comes to a stop. The next time you run the pipeline, the pipeline processes only files with a last-modified timestamp after that offset.
Now, let's say you need to change the destination system that the pipeline writes to, and you want to reprocess all available data to write the results to the new destination system. To do this, you replace the destination in the pipeline. Then, when you start the pipeline, you use the Reset Offsets and Start option.
The pipeline processes all available data in a single batch and stops. As before, it stores the offset. Then on subsequent pipeline runs, it continues processing from the last-saved offset.
To reset pipeline offsets before starting a pipeline, click the menu arrow to the right of the Start button (), then click Reset Offsets & Start.