Persist vs Checkpoint¶
Spark Internals - 6-CacheAndCheckpoint.md has a good explanation of persist vs checkpoint.
Persist/Cache in Spark is lazy and doesn't truncate the lineage while checkpoint is eager (by default) and truncates the lineage.
Generally speaking,
DataFrame.persist
has a better performance thanDataFrame.checkpoint
. However,DataFrame.checkpoint
is more robust and is preferred in any of the following situations.
Types of Joins of Spark DataFrames
Comments¶
It is suggested that you always pass a list of columns to the parameter
on
even if there's only one column for joining.None
in a pandas DataFrame is converted toNaN
instead ofnull
!Spark allows using following join types:
inner
(default)cross
outer
full
,fullouter
,full_outer
left
,leftouter
,left_outer
right
,rightouter
,right_outer
semi
,leftsemi
,left_semi
anti
,leftanti
,left_anti
Inner Join of Spark DataFrames
Tips and Traps¶
Select only needed columns before joining.
Rename joining column names to be identical (if different) before joining.
Union RDDs in Spark
No deduplication is done (to be efficient) when unioning RDDs/DataFrames in Spark 2.1.0+.
-
Union 2 RDDs.
df1.union(df2) // or for old-fashioned RDD rdd1.union(rdd_2)
-
Union multiple RDDs.
df = spark.union([df1, df2, df3]) // spark is a SparkSession object // or for old-fashioned RDD rdd …