Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
https://
https://
import pandas as pdimport socket
import findspark
findspark.init("/opt/spark-3.2.0-bin-hadoop3.2")
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import *
spark = SparkSession.builder.appName("PySpark").enableHiveSupport().getOrCreate()df = spark.createDataFrame(
pd.DataFrame(
data=(
("Ben", "Du", 1),
("Ben", "Du", 2),
("Ben", "Tu", 3),
("Ben", "Tu", 4),
("Ken", "Xu", 1),
("Ken", "Xu", 9),
),
columns=("fname", "lname", "score"),
)
)
df.show()+-----+-----+-----+
|fname|lname|score|
+-----+-----+-----+
| Ben| Du| 1|
| Ben| Du| 2|
| Ben| Tu| 3|
| Ben| Tu| 4|
| Ken| Xu| 1|
| Ken| Xu| 9|
+-----+-----+-----+
DataFrame.stat.approxQuantile¶
Notice that it returns a Double array.
df.stat.approxQuantile("score", [0.5], 0.1)[2.0]df.stat.approxQuantile("score", [0.5], 0.001)[2.0]df.stat.approxQuantile("score", [0.5], 0.5)[1.0]