Cache Levels and Replicas

Spark can cache data implicitly or explicitly as it runs a pipeline. Implicit caching controlled by Spark, such as when Spark performs shuffle operations, is stored in the Spark default location. At this time, the default location is memory and disk.

Explicit caching occurs when an origin or processor stage has the Cache Data stage property enabled.

The following pipeline properties define how Spark handles explicit caching:

Cache Levels property

Use the Cache Levels advanced pipeline property to configure how data is cached for a pipeline. The Cache Levels property provides the following caching levels:

None
Disk only
Memory only
Memory only with serialization
Memory and disk
Memory and disk with serialization
Off heap

For more information about these options, see the Spark documentation.

Cache Replicas property

Use the Cache Replicas advanced pipeline property to determine how many replicas of the cache are kept.

Under certain conditions, replicas can increase performance. For more information, see the Spark documentation.