Sort DataFrame in Spark

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Comments¶

After sorting, rows in a DataFrame are sorted according to partition ID. And within each partition, rows are sorted. This property can be leverated to implement global ranking of rows. For more details, please refer to Computing global rank of a row in a DataFrame with Spark SQL. However, notice that multi-layer ranking is often more efficiency than a global ranking in big data applications.

import findspark

findspark.init("/opt/spark")

from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import *
from pyspark.sql.types import StructType

spark = (
    SparkSession.builder.appName("PySpark_Sorting").enableHiveSupport().getOrCreate()
)

import pandas as pd

df_p = pd.DataFrame(
    [
        ("Ben", "Du", 1),
        ("Ben", "Du", 2),
        ("Ken", "Xu", 1),
        ("Ken", "Xu", 9),
        ("Ben", "Tu", 3),
        ("Ben", "Tu", 4),
    ],
    columns=["first_name", "last_name", "id"],
)
df_p

df = spark.createDataFrame(df_p)
df.show()

+----------+---------+---+
|first_name|last_name| id|
+----------+---------+---+
|       Ben|       Du|  1|
|       Ben|       Du|  2|
|       Ken|       Xu|  1|
|       Ken|       Xu|  9|
|       Ben|       Tu|  3|
|       Ben|       Tu|  4|
+----------+---------+---+

df.orderBy(["first_name", "last_name"]).show()

+----------+---------+---+
|first_name|last_name| id|
+----------+---------+---+
|       Ben|       Du|  1|
|       Ben|       Du|  2|
|       Ben|       Tu|  4|
|       Ben|       Tu|  3|
|       Ken|       Xu|  9|
|       Ken|       Xu|  1|
+----------+---------+---+

Note: The asecending keyword below cannot be omitted!

df.orderBy(["first_name", "last_name"], ascending=[False, False]).show()

+----------+---------+---+
|first_name|last_name| id|
+----------+---------+---+
|       Ken|       Xu|  9|
|       Ken|       Xu|  1|
|       Ben|       Tu|  3|
|       Ben|       Tu|  4|
|       Ben|       Du|  1|
|       Ben|       Du|  2|
+----------+---------+---+

References¶

https://medium.com/swlh/computing-global-rank-of-a-row-in-a-dataframe-with-spark-sql-34f6cc650ae5