Ben Chuanlong Du's Blog

It is never too late to learn.

Null Values in Inner Join of Spark Dataframes

In [ ]:
%%classpath add mvn
org.apache.spark spark-core_2.11 2.3.1
org.apache.spark spark-sql_2.11 2.3.1
org.apache.spark spark-hive_2.11 2.3.1
In [ ]:
import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder()
    .master("local[2]")
    .appName("Spark-Null")
    .getOrCreate()
import spark.implicits._

Comment

  1. null values are excluded from inner joinning even if both tables have null values in the field for inner joining.

  2. It is almost always a good idea to filter out null value in the joinining columns before joining no matter it is an inner join or an outer join (of course if the rows containing null matters in your use case, you have to do a union of those records). Spark (at least in Spark 2.3 and older) is stupid enough not to filter out joining keys/columns with null values before even INNER join (even if null values are dropped after inner join). This means that if a joining key/column has lots of null values, it get shuffle into the same node in SortMergeJoin. This can cause a serious data skew issue

In [ ]:
val left = Seq(
    ("bob", "2015-01-13", 4), 
    (null, "2015-04-23",10)
).toDF("name","date","duration")
left.show()
In [ ]:
val right = Seq(
    (null, 100),
    ("bob", 23)
).toDF("name","upload")
right.show()
In [ ]:
left.join(right, Seq("name")).show

Comments