Null Values in Inner Join of Spark Dataframes

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

%%classpath add mvn
org.apache.spark spark-core_2.11 2.3.1
org.apache.spark spark-sql_2.11 2.3.1
org.apache.spark spark-hive_2.11 2.3.1

import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder()
    .master("local[2]")
    .appName("Spark-Null")
    .getOrCreate()
import spark.implicits._

Comment¶

null values are excluded from inner joinning even if both tables have null values in the field for inner joining.
It is almost always a good idea to filter out null value in the joinining columns before joining no matter it is an inner join or an outer join (of course if the rows containing null matters in your use case, you have to do a union of those records). Spark (at least in Spark 2.3 and older) is stupid enough not to filter out joining keys/columns with null values before even INNER join (even if null values are dropped after inner join). This means that if a joining key/column has lots of null values, it get shuffle into the same node in SortMergeJoin. This can cause a serious data skew issue

val left = Seq(
    ("bob", "2015-01-13", 4), 
    (null, "2015-04-23",10)
).toDF("name","date","duration")
left.show()

val right = Seq(
    (null, 100),
    ("bob", 23)
).toDF("name","upload")
right.show()

left.join(right, Seq("name")).show