Cache Levels and Replicas

Spark can cache data implicitly or explicitly as it runs a pipeline. Implicit caching controlled by Spark, such as when Spark performs shuffle operations, is stored in the Spark default location. At this time, the default location is memory and disk.

Explicit caching occurs when an origin or processor stage has the Cache Data stage property enabled.

The following pipeline properties define how Spark handles explicit caching:
Cache Levels property
Use the Cache Levels advanced pipeline property to configure how data is cached for a pipeline. The Cache Levels property provides the following caching levels:
  • None
  • Disk only
  • Memory only
  • Memory only with serialization
  • Memory and disk
  • Memory and disk with serialization
  • Off heap

For more information about these options, see the Spark documentation.

Cache Replicas property
Use the Cache Replicas advanced pipeline property to determine how many replicas of the cache are kept.

Under certain conditions, replicas can increase performance. For more information, see the Spark documentation.