Ben Chuanlong Du's Blog

And let it direct your passion with reason.

Persist and Checkpoint DataFrames in Spark

Persist vs Checkpoint

Spark Internals - 6-CacheAndCheckpoint.md has a good explanation of persist vs checkpoint.

  1. Persist/Cache in Spark is lazy and doesn't truncate the lineage while checkpoint is eager (by default) and truncates the lineage.

  2. Generally speaking, DataFrame.persist has a better performance than DataFrame.checkpoint. However, DataFrame.checkpoint is more robust and is preferred in any of the following situations.

Types of Joins of Spark DataFrames

Comments

  1. It is suggested that you always pass a list of columns to the parameter on even if there's only one column for joining.

  2. None in a pandas DataFrame is converted to NaN instead of null!

  3. Spark allows using following join types:

    • inner (default)
    • cross
    • outer
    • full, fullouter, full_outer
    • left, leftouter, left_outer
    • right, rightouter, right_outer
    • semi, leftsemi, left_semi
    • anti, leftanti, left_anti

Union RDDs in Spark

No deduplication is done (to be efficient) when unioning RDDs/DataFrames in Spark 2.1.0+.

  1. Union 2 RDDs.

    df1.union(df2)
    // or for old-fashioned RDD
    rdd1.union(rdd_2)
    
  2. Union multiple RDDs.

    df = spark.union([df1, df2, df3]) // spark is a SparkSession object
    // or for old-fashioned RDD
    rdd …

Union DataFrames in Spark

Comment

  1. union relies on column order rather than column names. This is the same as in SQL. For columns that the type don't match, the super type is used. However, this is really dangerous if you are careful. It is suggested that you define a function call unionByName to hanle this.