Persist vs Checkpoint¶
Spark Internals - 6-CacheAndCheckpoint.md has a good explanation of persist vs checkpoint.
Persist/Cache in Spark is lazy and doesn't truncate the lineage while checkpoint is eager (by default) and truncates the lineage.
Generally speaking,
DataFrame.persist
has a better performance thanDataFrame.checkpoint
. However,DataFrame.checkpoint
is more robust and is preferred in any of the following situations.