Caching Data

You can configure most origins and processors to cache data. You might enable caching when a stage passes data to more than one downstream stage.

By default, Spark performs stage processing to generate the data for each downstream stage. When a stage passes data to multiple downstream stages, this causes the stage to reprocess data for each stage. To prevent unnecessary reprocessing, you can configure the stage to cache data.

When a stage caches data, it caches data temporarily. The stage processes a batch of data, caches it, then passes the cached data to all downstream stages. When used in a streaming pipeline, the stage processes and caches each new batch of data.

Caching data should result in equal or improved performance, but results can vary. For example, when processing small batches of data with an efficient Transformer machine and a complex pipeline where only one stage passes data to two stages, caching may have little effect on overall pipeline performance. In contrast, when a stage performs complex processing on large batches of data and passes the results to 10 stages, enabling caching for the stage can improve pipeline performance.

Note: When you enable ludicrous mode to improve pipeline performance, caching can limit the pushdown optimization that ludicrous mode performs.