Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
Comments¶
After sorting, rows in a DataFrame are sorted according to partition ID. And within each partition, rows are sorted. This property can be leverated to implement global ranking of rows. For more details, please refer to Computing global rank of a row in a DataFrame with Spark SQL. However, notice that multi-layer ranking is often more efficiency than a global ranking in big data applications.
import findspark
findspark.init("/opt/spark")
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import *
from pyspark.sql.types import StructType
spark = (
SparkSession.builder.appName("PySpark_Sorting").enableHiveSupport().getOrCreate()
)import pandas as pddf_p = pd.DataFrame(
[
("Ben", "Du", 1),
("Ben", "Du", 2),
("Ken", "Xu", 1),
("Ken", "Xu", 9),
("Ben", "Tu", 3),
("Ben", "Tu", 4),
],
columns=["first_name", "last_name", "id"],
)
df_pLoading...
df = spark.createDataFrame(df_p)
df.show()+----------+---------+---+
|first_name|last_name| id|
+----------+---------+---+
| Ben| Du| 1|
| Ben| Du| 2|
| Ken| Xu| 1|
| Ken| Xu| 9|
| Ben| Tu| 3|
| Ben| Tu| 4|
+----------+---------+---+
df.orderBy(["first_name", "last_name"]).show()+----------+---------+---+
|first_name|last_name| id|
+----------+---------+---+
| Ben| Du| 1|
| Ben| Du| 2|
| Ben| Tu| 4|
| Ben| Tu| 3|
| Ken| Xu| 9|
| Ken| Xu| 1|
+----------+---------+---+
Note: The asecending keyword below cannot be omitted!
df.orderBy(["first_name", "last_name"], ascending=[False, False]).show()+----------+---------+---+
|first_name|last_name| id|
+----------+---------+---+
| Ken| Xu| 9|
| Ken| Xu| 1|
| Ben| Tu| 3|
| Ben| Tu| 4|
| Ben| Du| 1|
| Ben| Du| 2|
+----------+---------+---+