Ben Chuanlong Du's Blog

It is never too late to learn.

Hands on GroupBy of Polars DataFrame in Python

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

In [1]:
import itertools as it
import polars as pl
In [2]:
df = pl.DataFrame(
    {
        "id": [0, 1, 2, 3, 4],
        "color": ["red", "green", "green", "red", "red"],
        "shape": ["square", "triangle", "square", "triangle", "square"],
    }
)
df
Out[2]:
shape: (5, 3)
idcolorshape
i64strstr
0"red""square"
1"green""triangle"
2"green""square"
3"red""triangle"
4"red""square"
In [11]:
df.groupby("color", maintain_order=True).agg(pl.col("id"))
Out[11]:
shape: (2, 2)
colorid
strlist[i64]
"red"[0, 3, 4]
"green"[1, 2]
In [17]:
df.groupby("color", maintain_order=True).agg(pl.col("id").first())
Out[17]:
shape: (2, 2)
colorid
stri64
"red"0
"green"1
In [4]:
def update_frame(frame):
    frame[0, "id"] = frame[0, "id"] * 1000
    return frame
In [5]:
df.groupby("color").apply(update_frame)
Out[5]:
shape: (5, 3)
id color shape
i64 str str
0 "red" "square"
3 "red" "triangle"
4 "red" "square"
1000 "green" "triangle"
2 "green" "square"

GroupBy + Aggregation

In [6]:
df.groupby("color").agg(pl.count().alias("n"))
Out[6]:
shape: (2, 2)
color n
str u32
"green" 2
"red" 3
In [9]:
pl.DataFrame(
    data=it.combinations(range(52), 4),
    orient="row"
).with_row_count().groupby([
    "column_0", 
    "column_1", 
    "column_2", 
]).agg(pl.col("row_nr").min()).sort([
    "column_0", 
    "column_1", 
    "column_2", 
])
Out[9]:
shape: (20825, 4)
column_0 column_1 column_2 row_nr
i64 i64 i64 u32
0 1 2 0
0 1 3 49
0 1 4 97
0 1 5 144
0 1 6 190
0 1 7 235
0 1 8 279
0 1 9 322
0 1 10 364
0 1 11 405
0 1 12 445
0 1 13 484
... ... ... ...
45 48 50 270708
45 49 50 270709
46 47 48 270710
46 47 49 270713
46 47 50 270715
46 48 49 270716
46 48 50 270718
46 49 50 270719
47 48 49 270720
47 48 50 270722
47 49 50 270723
48 49 50 270724

GroupBy as An Iterable

In [11]:
pl.Series(
    (g, frame.shape[0])
    for g, frame in df.groupby("color")
)
Out[11]:
shape: (2,)
object
('red', 3)
('green', 2)
In [ ]:
 

Comments