Ben Chuanlong Du's Blog

And let it direct your passion with reason.

Persist and Checkpoint DataFrames in Spark

Persist vs Checkpoint

Spark Internals - 6-CacheAndCheckpoint.md has a good explanation of persist vs checkpoint.

  1. Persist/Cache in Spark is lazy and doesn't truncate the lineage while checkpoint is eager (by default) and truncates the lineage.

  2. Generally speaking, DataFrame.persist has a better performance than DataFrame.checkpoint. However, DataFrame.checkpoint is more robust and is preferred in any of the following situations.

Probability to Lose All Money

A few days ago I found someone asking an interview questions on mitbbs. The question is as follows. A gambler plays a fair game and bet 1 dollar each time. If he lose all his money, the game stops. Suppose he has 10 dollars and is only allowed to play …