Comments¶
It is suggested that you always pass a list of columns to the parameter
oneven if there's only one column for joining.Nonein a pandas DataFrame is converted toNaNinstead ofnull!Spark allows using following join types:
inner(default)crossouterfull,fullouter,full_outerleft,leftouter,left_outerright,rightouter,right_outersemi,leftsemi,left_semianti,leftanti,left_anti
Inner Join of Spark DataFrames
Tips and Traps¶
Select only needed columns before joining.
Rename joining column names to be identical (if different) before joining.
Union RDDs in Spark
No deduplication is done (to be efficient) when unioning RDDs/DataFrames in Spark 2.1.0+.
-
Union 2 RDDs.
df1.union(df2) // or for old-fashioned RDD rdd1.union(rdd_2) -
Union multiple RDDs.
df = spark.union([df1, df2, df3]) // spark is a SparkSession object // or for old-fashioned RDD rdd …