Tips and Traps¶
Select only needed columns before joining.
Rename joining column names to be identical (if different) before joining.
import pandas as pd
import findspark
findspark.init("/opt/spark-3.1.1-bin-hadoop3.2/")
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import *
from pyspark.sql.types import StructType
spark = SparkSession.builder.appName("Join").enableHiveSupport().getOrCreate()
Same Names in Both Tables¶
left = spark.createDataFrame(
pd.DataFrame(
data=(("bob", "2015-01-13", 4), ("alice", "2015-04-23", 10)),
columns=("name", "date", "duration"),
)
)
left.show()
right = spark.createDataFrame(
pd.DataFrame(data=(("alice", 100), ("bob", 23)), columns=("name", "upload"))
)
right.show()
Duplicate columns happens if you use an expression as join condition!
left.join(right, left["name"] == right["name"]).show()
left.join(right, left.name == right.name).show()
Using (a list of) string names can avoid duplicate columns.
left.join(right, ["name"]).show()
left.join(right, "name").show()
Same Columns Not in Join¶
left = spark.createDataFrame(
pd.DataFrame(
data=(("bob", "2015-01-13", 4), ("alice", "2015-04-23", 10)),
columns=("name", "date", "duration"),
)
)
left.show()
right = spark.createDataFrame(
pd.DataFrame(
data=(("alice", 100, 1), ("bob", 23, 2)), columns=("name", "upload", "duration")
)
)
right.show()
Join the 2 DataFrame by the name
column.
Duplicate columns happen as the duration
column is in both DataFrame.
left.join(right, "name").show()
Select via string names works on non duplicate columns. Exception will be throw if you select a duplicate column using string names.
left.alias("l").join(right.alias("r"), "name").select("name", "date").show()
Select using column
objects.
left.join(right, "name").select(left["name"], left["date"], left["duration"]).show()
Using table alias is probably the most convenient way (in syntax). Similar to SQL, you don't have to specify table when there's no ambiguition.
left.alias("l").join(right.alias("r"), "name").select(
"name", "date", "l.duration", "upload"
).show()
Star in Select¶
Notice that *
can be used to select all columns from a table.
left.alias("l").join(right.alias("r"), "name").select("l.*").show()
Different Names for Joining¶
If you want to do inner join only, it is suggested that you rename the columns to join to have the same names so that
- minimal number of columns
- no duplicate columns
left = spark.createDataFrame(
pd.DataFrame(
data=(("bob", "2015-01-13", 4), ("alice", "2015-04-23", 10)),
columns=("name", "date", "duration"),
)
)
left.show()
right = spark.createDataFrame(
pd.DataFrame(
data=(("alice", 100, 1), ("bob", 23, 2)), columns=("nm", "upload", "duration")
)
)
right.show()
left.join(right, left["name"] == right["nm"]).show()
left.join(right.withColumnRenamed("nm", "name"), ["name"]).show()