Persist vs Checkpoint¶
Spark Internals - 6-CacheAndCheckpoint.md has a good explanation of persist vs checkpoint.
Persist/Cache in Spark is lazy and doesn't truncate the lineage while checkpoint is eager (by default) and truncates the lineage.
Generally speaking,
DataFrame.persist
has a better performance thanDataFrame.checkpoint
. However,DataFrame.checkpoint
is more robust and is preferred in any of the following situations.
Probability to Lose All Money
A few days ago I found someone asking an interview questions on mitbbs. The question is as follows. A gambler plays a fair game and bet 1 dollar each time. If he lose all his money, the game stops. Suppose he has 10 dollars and is only allowed to play …