Ben Chuanlong Du's Blog

It is never too late to learn.

Hands on the Polars Library in Python

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Tips and Traps

  1. polars.DataFrame.unique and polars.Series.unique do not maintain the original order by default. To maintain the original order, pass the option maintain_order=True.

Polars

Polars is a blazingly fast DataFrames library implemented in Rust using Apache Arrow as memory model.

  1. It is the best replacement of pandas for small data at this time.

  2. Polars support multithreading and lazy computation.

  3. Polars CANNOT handle data larger than memory at this time (even though this might change in future).

Comparison with pandas DataFrame

  1. Polars intentionally leaves out the concept of (row) index.

  2. There are no methods such as loc and iloc in Polars. You can use df.get_column / df.[col], df.get_columns / df.[[col1, col2]] to access columns.

  3. Similar to pandas DataFrame, chaining access works but chaining assignment doesn't work. To assign value of an element, use df[row_index, col_name] = val instead. However, notice that this is inefficient as it updates the whole column under the hood. If you have to update values of a column in a Polars DataFrame, do NOT loop through each cell to update it. Instead, create a Series which contains updated values and then update the column only once. For more discussions, please refer to Efficient way to update a single cell of a Polars DataFrame? .

  4. Polars DataFrame provides APIs DataFrame.from_pandas and DataFrame.to_pandas to convert between Polars/pandas DataFrames.

  5. Polars' APIs for parsing CSV files is not as flexible as pandas's. Lucky that we can parse CSV files using pandas and then convert pandas DataFrmaes into Polars DataFrames.

In [1]:
!pip3 install --user polars
Collecting polars
  Downloading polars-0.16.2-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (15.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 15.2/15.2 MB 28.7 MB/s eta 0:00:0000:0100:01
Installing collected packages: polars
Successfully installed polars-0.16.2

[notice] A new release of pip available: 22.3.1 -> 23.0
[notice] To update, run: python3 -m pip install --upgrade pip
In [2]:
import itertools as it
import polars as pl

Series

In [38]:
[m for m in dir(pl.Series) if not m.startswith("_")]
Out[38]:
['abs',
 'alias',
 'all',
 'any',
 'append',
 'apply',
 'arccos',
 'arccosh',
 'arcsin',
 'arcsinh',
 'arctan',
 'arctanh',
 'arg_max',
 'arg_min',
 'arg_sort',
 'arg_true',
 'arg_unique',
 'argsort',
 'arr',
 'bin',
 'cast',
 'cat',
 'ceil',
 'chunk_lengths',
 'cleared',
 'clip',
 'clip_max',
 'clip_min',
 'clone',
 'cos',
 'cosh',
 'cummax',
 'cummin',
 'cumprod',
 'cumsum',
 'cumulative_eval',
 'describe',
 'diff',
 'dot',
 'drop_nans',
 'drop_nulls',
 'dt',
 'dtype',
 'entropy',
 'estimated_size',
 'ewm_mean',
 'ewm_std',
 'ewm_var',
 'exp',
 'explode',
 'extend_constant',
 'fill_nan',
 'fill_null',
 'filter',
 'flags',
 'floor',
 'get_chunks',
 'has_validity',
 'hash',
 'head',
 'inner_dtype',
 'interpolate',
 'is_boolean',
 'is_datelike',
 'is_duplicated',
 'is_empty',
 'is_finite',
 'is_first',
 'is_float',
 'is_in',
 'is_infinite',
 'is_nan',
 'is_not_nan',
 'is_not_null',
 'is_null',
 'is_numeric',
 'is_sorted',
 'is_unique',
 'is_utf8',
 'item',
 'kurtosis',
 'len',
 'limit',
 'log',
 'log10',
 'max',
 'mean',
 'median',
 'min',
 'mode',
 'n_chunks',
 'n_unique',
 'name',
 'nan_max',
 'nan_min',
 'new_from_index',
 'null_count',
 'pct_change',
 'peak_max',
 'peak_min',
 'product',
 'quantile',
 'rank',
 'rechunk',
 'reinterpret',
 'rename',
 'reshape',
 'reverse',
 'rolling_apply',
 'rolling_max',
 'rolling_mean',
 'rolling_median',
 'rolling_min',
 'rolling_quantile',
 'rolling_skew',
 'rolling_std',
 'rolling_sum',
 'rolling_var',
 'round',
 'sample',
 'search_sorted',
 'series_equal',
 'set',
 'set_at_idx',
 'set_sorted',
 'shape',
 'shift',
 'shift_and_fill',
 'shrink_dtype',
 'shrink_to_fit',
 'shuffle',
 'sign',
 'sin',
 'sinh',
 'skew',
 'slice',
 'sort',
 'sqrt',
 'std',
 'str',
 'struct',
 'sum',
 'tail',
 'take',
 'take_every',
 'tan',
 'tanh',
 'time_unit',
 'to_arrow',
 'to_dummies',
 'to_frame',
 'to_list',
 'to_numpy',
 'to_pandas',
 'to_physical',
 'top_k',
 'unique',
 'unique_counts',
 'value_counts',
 'var',
 'view',
 'zip_with']
In [36]:
s = pl.Series([1, 2, 3])
s
Out[36]:
shape: (3,)
i64
1
2
3
In [37]:
s[0] = 100
s
Out[37]:
shape: (3,)
i64
100
2
3

DataFrame

In [25]:
[m for m in dir(pl.DataFrame) if not m.startswith("_")]
Out[25]:
['apply',
 'cleared',
 'clone',
 'columns',
 'describe',
 'drop',
 'drop_in_place',
 'drop_nulls',
 'dtypes',
 'estimated_size',
 'explode',
 'extend',
 'fill_nan',
 'fill_null',
 'filter',
 'find_idx_by_name',
 'fold',
 'frame_equal',
 'get_column',
 'get_columns',
 'glimpse',
 'groupby',
 'groupby_dynamic',
 'groupby_rolling',
 'hash_rows',
 'head',
 'height',
 'hstack',
 'insert_at_idx',
 'interpolate',
 'is_duplicated',
 'is_empty',
 'is_unique',
 'item',
 'iterrows',
 'join',
 'join_asof',
 'lazy',
 'limit',
 'max',
 'mean',
 'median',
 'melt',
 'merge_sorted',
 'min',
 'n_chunks',
 'n_unique',
 'null_count',
 'partition_by',
 'pearson_corr',
 'pipe',
 'pivot',
 'product',
 'quantile',
 'rechunk',
 'rename',
 'replace',
 'replace_at_idx',
 'reverse',
 'row',
 'rows',
 'sample',
 'schema',
 'select',
 'shape',
 'shift',
 'shift_and_fill',
 'shrink_to_fit',
 'slice',
 'sort',
 'std',
 'sum',
 'tail',
 'take_every',
 'to_arrow',
 'to_dict',
 'to_dicts',
 'to_dummies',
 'to_numpy',
 'to_pandas',
 'to_series',
 'to_struct',
 'transpose',
 'unique',
 'unnest',
 'unstack',
 'upsample',
 'var',
 'vstack',
 'width',
 'with_column',
 'with_columns',
 'with_row_count',
 'write_avro',
 'write_csv',
 'write_ipc',
 'write_json',
 'write_ndjson',
 'write_parquet']
In [4]:
df = pl.read_csv("https://j.mp/iriscsv")
df
Out[4]:
shape: (150, 5)
sepal_length sepal_width petal_length petal_width species
f64 f64 f64 f64 str
5.1 3.5 1.4 0.2 "setosa"
4.9 3.0 1.4 0.2 "setosa"
4.7 3.2 1.3 0.2 "setosa"
4.6 3.1 1.5 0.2 "setosa"
5.0 3.6 1.4 0.2 "setosa"
5.4 3.9 1.7 0.4 "setosa"
4.6 3.4 1.4 0.3 "setosa"
5.0 3.4 1.5 0.2 "setosa"
4.4 2.9 1.4 0.2 "setosa"
4.9 3.1 1.5 0.1 "setosa"
5.4 3.7 1.5 0.2 "setosa"
4.8 3.4 1.6 0.2 "setosa"
... ... ... ... ...
6.0 3.0 4.8 1.8 "virginica"
6.9 3.1 5.4 2.1 "virginica"
6.7 3.1 5.6 2.4 "virginica"
6.9 3.1 5.1 2.3 "virginica"
5.8 2.7 5.1 1.9 "virginica"
6.8 3.2 5.9 2.3 "virginica"
6.7 3.3 5.7 2.5 "virginica"
6.7 3.0 5.2 2.3 "virginica"
6.3 2.5 5.0 1.9 "virginica"
6.5 3.0 5.2 2.0 "virginica"
6.2 3.4 5.4 2.3 "virginica"
5.9 3.0 5.1 1.8 "virginica"
In [32]:
df["sepal_length"]
Out[32]:
shape: (150,)
sepal_length
f64
1000.0
4.9
4.7
4.6
5.0
5.4
4.6
5.0
4.4
4.9
5.4
4.8
...
6.0
6.9
6.7
6.9
5.8
6.8
6.7
6.7
6.3
6.5
6.2
5.9

Similar to pandas DataFrame, chaining assignment does NOT work!

In [35]:
df["sepal_length"][0] = 10000
df
Out[35]:
shape: (150, 5)
sepal_length sepal_width petal_length petal_width species
f64 f64 f64 f64 str
1000.0 3.5 1.4 0.2 "setosa"
4.9 3.0 1.4 0.2 "setosa"
4.7 3.2 1.3 0.2 "setosa"
4.6 3.1 1.5 0.2 "setosa"
5.0 3.6 1.4 0.2 "setosa"
5.4 3.9 1.7 0.4 "setosa"
4.6 3.4 1.4 0.3 "setosa"
5.0 3.4 1.5 0.2 "setosa"
4.4 2.9 1.4 0.2 "setosa"
4.9 3.1 1.5 0.1 "setosa"
5.4 3.7 1.5 0.2 "setosa"
4.8 3.4 1.6 0.2 "setosa"
... ... ... ... ...
6.0 3.0 4.8 1.8 "virginica"
6.9 3.1 5.4 2.1 "virginica"
6.7 3.1 5.6 2.4 "virginica"
6.9 3.1 5.1 2.3 "virginica"
5.8 2.7 5.1 1.9 "virginica"
6.8 3.2 5.9 2.3 "virginica"
6.7 3.3 5.7 2.5 "virginica"
6.7 3.0 5.2 2.3 "virginica"
6.3 2.5 5.0 1.9 "virginica"
6.5 3.0 5.2 2.0 "virginica"
6.2 3.4 5.4 2.3 "virginica"
5.9 3.0 5.1 1.8 "virginica"

You can slice by row and column at the same time.

In [28]:
df[0, "sepal_length"]
Out[28]:
5.1
In [65]:
dir(df)
Out[65]:
['__add__',
 '__annotations__',
 '__bool__',
 '__class__',
 '__contains__',
 '__copy__',
 '__deepcopy__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mod__',
 '__module__',
 '__mul__',
 '__ne__',
 '__new__',
 '__radd__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmul__',
 '__setattr__',
 '__setitem__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__sub__',
 '__subclasshook__',
 '__truediv__',
 '__weakref__',
 '_accessors',
 '_comp',
 '_compare_to_non_df',
 '_compare_to_other_df',
 '_df',
 '_from_arrow',
 '_from_dict',
 '_from_dicts',
 '_from_numpy',
 '_from_pandas',
 '_from_pydf',
 '_from_records',
 '_ipython_key_completions_',
 '_pos_idx',
 '_pos_idxs',
 '_read_avro',
 '_read_csv',
 '_read_ipc',
 '_read_json',
 '_read_ndjson',
 '_read_parquet',
 '_repr_html_',
 'apply',
 'cleared',
 'clone',
 'columns',
 'describe',
 'drop',
 'drop_in_place',
 'drop_nulls',
 'dtypes',
 'estimated_size',
 'explode',
 'extend',
 'fill_nan',
 'fill_null',
 'filter',
 'find_idx_by_name',
 'fold',
 'frame_equal',
 'get_column',
 'get_columns',
 'glimpse',
 'groupby',
 'groupby_dynamic',
 'groupby_rolling',
 'hash_rows',
 'head',
 'height',
 'hstack',
 'insert_at_idx',
 'interpolate',
 'is_duplicated',
 'is_empty',
 'is_unique',
 'item',
 'iterrows',
 'join',
 'join_asof',
 'lazy',
 'limit',
 'max',
 'mean',
 'median',
 'melt',
 'merge_sorted',
 'min',
 'n_chunks',
 'n_unique',
 'null_count',
 'partition_by',
 'pearson_corr',
 'pipe',
 'pivot',
 'product',
 'quantile',
 'rechunk',
 'rename',
 'replace',
 'replace_at_idx',
 'reverse',
 'row',
 'rows',
 'sample',
 'schema',
 'select',
 'shape',
 'shift',
 'shift_and_fill',
 'shrink_to_fit',
 'slice',
 'sort',
 'std',
 'sum',
 'tail',
 'take_every',
 'to_arrow',
 'to_dict',
 'to_dicts',
 'to_dummies',
 'to_numpy',
 'to_pandas',
 'to_series',
 'to_struct',
 'transpose',
 'unique',
 'unnest',
 'unstack',
 'upsample',
 'var',
 'vstack',
 'width',
 'with_column',
 'with_columns',
 'with_row_count',
 'write_avro',
 'write_csv',
 'write_ipc',
 'write_json',
 'write_ndjson',
 'write_parquet']
In [29]:
df[0, "sepal_length"] = 1000
df
Out[29]:
shape: (150, 5)
sepal_length sepal_width petal_length petal_width species
f64 f64 f64 f64 str
1000.0 3.5 1.4 0.2 "setosa"
4.9 3.0 1.4 0.2 "setosa"
4.7 3.2 1.3 0.2 "setosa"
4.6 3.1 1.5 0.2 "setosa"
5.0 3.6 1.4 0.2 "setosa"
5.4 3.9 1.7 0.4 "setosa"
4.6 3.4 1.4 0.3 "setosa"
5.0 3.4 1.5 0.2 "setosa"
4.4 2.9 1.4 0.2 "setosa"
4.9 3.1 1.5 0.1 "setosa"
5.4 3.7 1.5 0.2 "setosa"
4.8 3.4 1.6 0.2 "setosa"
... ... ... ... ...
6.0 3.0 4.8 1.8 "virginica"
6.9 3.1 5.4 2.1 "virginica"
6.7 3.1 5.6 2.4 "virginica"
6.9 3.1 5.1 2.3 "virginica"
5.8 2.7 5.1 1.9 "virginica"
6.8 3.2 5.9 2.3 "virginica"
6.7 3.3 5.7 2.5 "virginica"
6.7 3.0 5.2 2.3 "virginica"
6.3 2.5 5.0 1.9 "virginica"
6.5 3.0 5.2 2.0 "virginica"
6.2 3.4 5.4 2.3 "virginica"
5.9 3.0 5.1 1.8 "virginica"
In [13]:
df.columns
Out[13]:
['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
In [16]:
s = df.get_column("sepal_length")
s
Out[16]:
shape: (150,)
sepal_length
f64
5.1
4.9
4.7
4.6
5.0
5.4
4.6
5.0
4.4
4.9
5.4
4.8
...
6.0
6.9
6.7
6.9
5.8
6.8
6.7
6.7
6.3
6.5
6.2
5.9
In [19]:
s[0] = 1000
s
Out[19]:
shape: (150,)
sepal_length
f64
1000.0
4.9
4.7
4.6
5.0
5.4
4.6
5.0
4.4
4.9
5.4
4.8
...
6.0
6.9
6.7
6.9
5.8
6.8
6.7
6.7
6.3
6.5
6.2
5.9
In [20]:
df
Out[20]:
shape: (150, 5)
sepal_length sepal_width petal_length petal_width species
f64 f64 f64 f64 str
5.1 3.5 1.4 0.2 "setosa"
4.9 3.0 1.4 0.2 "setosa"
4.7 3.2 1.3 0.2 "setosa"
4.6 3.1 1.5 0.2 "setosa"
5.0 3.6 1.4 0.2 "setosa"
5.4 3.9 1.7 0.4 "setosa"
4.6 3.4 1.4 0.3 "setosa"
5.0 3.4 1.5 0.2 "setosa"
4.4 2.9 1.4 0.2 "setosa"
4.9 3.1 1.5 0.1 "setosa"
5.4 3.7 1.5 0.2 "setosa"
4.8 3.4 1.6 0.2 "setosa"
... ... ... ... ...
6.0 3.0 4.8 1.8 "virginica"
6.9 3.1 5.4 2.1 "virginica"
6.7 3.1 5.6 2.4 "virginica"
6.9 3.1 5.1 2.3 "virginica"
5.8 2.7 5.1 1.9 "virginica"
6.8 3.2 5.9 2.3 "virginica"
6.7 3.3 5.7 2.5 "virginica"
6.7 3.0 5.2 2.3 "virginica"
6.3 2.5 5.0 1.9 "virginica"
6.5 3.0 5.2 2.0 "virginica"
6.2 3.4 5.4 2.3 "virginica"
5.9 3.0 5.1 1.8 "virginica"
In [22]:
df.get_column("sepal_length")[0] = 2000
df
Out[22]:
shape: (150, 5)
sepal_length sepal_width petal_length petal_width species
f64 f64 f64 f64 str
5.1 3.5 1.4 0.2 "setosa"
4.9 3.0 1.4 0.2 "setosa"
4.7 3.2 1.3 0.2 "setosa"
4.6 3.1 1.5 0.2 "setosa"
5.0 3.6 1.4 0.2 "setosa"
5.4 3.9 1.7 0.4 "setosa"
4.6 3.4 1.4 0.3 "setosa"
5.0 3.4 1.5 0.2 "setosa"
4.4 2.9 1.4 0.2 "setosa"
4.9 3.1 1.5 0.1 "setosa"
5.4 3.7 1.5 0.2 "setosa"
4.8 3.4 1.6 0.2 "setosa"
... ... ... ... ...
6.0 3.0 4.8 1.8 "virginica"
6.9 3.1 5.4 2.1 "virginica"
6.7 3.1 5.6 2.4 "virginica"
6.9 3.1 5.1 2.3 "virginica"
5.8 2.7 5.1 1.9 "virginica"
6.8 3.2 5.9 2.3 "virginica"
6.7 3.3 5.7 2.5 "virginica"
6.7 3.0 5.2 2.3 "virginica"
6.3 2.5 5.0 1.9 "virginica"
6.5 3.0 5.2 2.0 "virginica"
6.2 3.4 5.4 2.3 "virginica"
5.9 3.0 5.1 1.8 "virginica"
In [17]:
type(s)
Out[17]:
polars.internals.series.series.Series

pl.all

In [22]:
comp.select(pl.all().all())
Out[22]:
shape: (1, 3)
j0 j1 ranks
bool bool bool
true true true

DataFrame.frame_equal

Check whether a DataFrame equals to another DataFrame, elementwise.

In [26]:
df.filter((df["i0"] == 1) & (df["i1"] == 2) & (df["i2"] == 13))[
    ["j0", "j1", "ranks"]
].frame_equal(
    df.filter((df["i0"] == 1) & (df["i1"] == 2) & (df["i2"] == 26))[
        ["j0", "j1", "ranks"]
    ]
)
Out[26]:
True
In [3]:
df = pl.DataFrame(
    {
        "id": [0, 1, 2, 3, 4],
        "color": ["red", "green", "green", "red", "red"],
        "shape": ["square", "triangle", "square", "triangle", "square"],
    }
)
df
Out[3]:
shape: (5, 3)
id color shape
i64 str str
0 "red" "square"
1 "green" "triangle"
2 "green" "square"
3 "red" "triangle"
4 "red" "square"
In [5]:
df.filter(pl.col("sepal_length") > 5).groupby("species").sum()
Out[5]:
shape: (3, 5)
species sepal_length sepal_width petal_length petal_width
str f64 f64 f64 f64
"versicolor" 281.9 131.8 202.9 63.3
"setosa" 116.9 81.7 33.2 6.1
"virginica" 324.5 146.2 273.1 99.6
In [7]:
df = pl.DataFrame(
    {
        "A": [1, 2, 3, 4, 5],
        "fruits": ["banana", "banana", "apple", "apple", "banana"],
        "B": [5, 4, 3, 2, 1],
        "cars": ["beetle", "audi", "beetle", "beetle", "beetle"],
    }
)
df
Out[7]:
A fruits B cars
i64 str i64 str
1 "banana" 5 "beetle"
2 "banana" 4 "audi"
3 "apple" 3 "beetle"
4 "apple" 2 "beetle"
5 "banana" 1 "beetle"
In [8]:
df.sort("fruits").select(
    [
        "fruits",
        "cars",
        pl.lit("fruits").alias("literal_string_fruits"),
        pl.col("B").filter(pl.col("cars") == "beetle").sum(),
        pl.col("A")
        .filter(pl.col("B") > 2)
        .sum()
        .over("cars")
        .alias("sum_A_by_cars"),  # groups by "cars"
        pl.col("A").sum().over("fruits").alias("sum_A_by_fruits"),  # groups by "fruits"
        pl.col("A")
        .reverse()
        .over("fruits")
        .flatten()
        .alias("rev_A_by_fruits"),  # groups by "fruits
        pl.col("A")
        .sort_by("B")
        .over("fruits")
        .flatten()
        .alias("sort_A_by_B_by_fruits"),  # groups by "fruits"
    ]
)
Out[8]:
fruits cars literal_string_fruits B sum_A_by_cars sum_A_by_fruits rev_A_by_fruits sort_A_by_B_by_fruits
str str str i64 i64 i64 i64 i64
"apple" "beetle" "fruits" 11 4 7 4 4
"apple" "beetle" "fruits" 11 4 7 3 3
"banana" "beetle" "fruits" 11 4 8 5 5
"banana" "audi" "fruits" 11 2 8 2 2
"banana" "beetle" "fruits" 11 4 8 1 1
In [54]:
df.to_dict("records")
Out[54]:
{'id': shape: (5,)
 Series: 'id' [i64]
 [
 	0
 	1
 	2
 	3
 	4
 ],
 'color': shape: (5,)
 Series: 'color' [str]
 [
 	"red"
 	"green"
 	"green"
 	"red"
 	"red"
 ],
 'shape': shape: (5,)
 Series: 'shape' [str]
 [
 	"square"
 	"triangle"
 	"square"
 	"triangle"
 	"square"
 ]}
In [56]:
df.to_dicts()
Out[56]:
[{'id': 0, 'color': 'red', 'shape': 'square'},
 {'id': 1, 'color': 'green', 'shape': 'triangle'},
 {'id': 2, 'color': 'green', 'shape': 'square'},
 {'id': 3, 'color': 'red', 'shape': 'triangle'},
 {'id': 4, 'color': 'red', 'shape': 'square'}]
In [55]:
dir(df)
Out[55]:
['__add__',
 '__annotations__',
 '__bool__',
 '__class__',
 '__contains__',
 '__copy__',
 '__deepcopy__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mod__',
 '__module__',
 '__mul__',
 '__ne__',
 '__new__',
 '__radd__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmul__',
 '__setattr__',
 '__setitem__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__sub__',
 '__subclasshook__',
 '__truediv__',
 '__weakref__',
 '_accessors',
 '_comp',
 '_compare_to_non_df',
 '_compare_to_other_df',
 '_df',
 '_from_arrow',
 '_from_dict',
 '_from_dicts',
 '_from_numpy',
 '_from_pandas',
 '_from_pydf',
 '_from_records',
 '_ipython_key_completions_',
 '_pos_idx',
 '_pos_idxs',
 '_read_avro',
 '_read_csv',
 '_read_ipc',
 '_read_json',
 '_read_ndjson',
 '_read_parquet',
 '_repr_html_',
 'apply',
 'cleared',
 'clone',
 'columns',
 'describe',
 'drop',
 'drop_in_place',
 'drop_nulls',
 'dtypes',
 'estimated_size',
 'explode',
 'extend',
 'fill_nan',
 'fill_null',
 'filter',
 'find_idx_by_name',
 'fold',
 'frame_equal',
 'get_column',
 'get_columns',
 'glimpse',
 'groupby',
 'groupby_dynamic',
 'groupby_rolling',
 'hash_rows',
 'head',
 'height',
 'hstack',
 'insert_at_idx',
 'interpolate',
 'is_duplicated',
 'is_empty',
 'is_unique',
 'item',
 'iterrows',
 'join',
 'join_asof',
 'lazy',
 'limit',
 'max',
 'mean',
 'median',
 'melt',
 'merge_sorted',
 'min',
 'n_chunks',
 'n_unique',
 'null_count',
 'partition_by',
 'pearson_corr',
 'pipe',
 'pivot',
 'product',
 'quantile',
 'rechunk',
 'rename',
 'replace',
 'replace_at_idx',
 'reverse',
 'row',
 'rows',
 'sample',
 'schema',
 'select',
 'shape',
 'shift',
 'shift_and_fill',
 'shrink_to_fit',
 'slice',
 'sort',
 'std',
 'sum',
 'tail',
 'take_every',
 'to_arrow',
 'to_dict',
 'to_dicts',
 'to_dummies',
 'to_numpy',
 'to_pandas',
 'to_series',
 'to_struct',
 'transpose',
 'unique',
 'unnest',
 'unstack',
 'upsample',
 'var',
 'vstack',
 'width',
 'with_column',
 'with_columns',
 'with_row_count',
 'write_avro',
 'write_csv',
 'write_ipc',
 'write_json',
 'write_ndjson',
 'write_parquet']
In [60]:
ss = df.to_struct("ss")
ss
Out[60]:
shape: (5,)
ss
struct[3]
{0,"red","square"}
{1,"green","triangle"}
{2,"green","square"}
{3,"red","triangle"}
{4,"red","square"}
In [62]:
type(ss[0])
Out[62]:
dict

sort

DataFrame.sort is not in-place. It returns a new DataFrame.

In [63]:
?pl.DataFrame.sort
Signature:
pl.DataFrame.sort(
    self: 'DF',
    by: 'str | pli.Expr | Sequence[str] | Sequence[pli.Expr]',
    reverse: 'bool | list[bool]' = False,
    nulls_last: 'bool' = False,
) -> 'DF | DataFrame'
Docstring:
Sort the DataFrame by column.

Parameters
----------
by
    By which column to sort. Only accepts string.
reverse
    Reverse/descending sort.
nulls_last
    Place null values last. Can only be used if sorted by a single column.

Examples
--------
>>> df = pl.DataFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6.0, 7.0, 8.0],
...         "ham": ["a", "b", "c"],
...     }
... )
>>> df.sort("foo", reverse=True)
shape: (3, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ str │
╞═════╪═════╪═════╡
│ 3   ┆ 8.0 ┆ c   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2   ┆ 7.0 ┆ b   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 1   ┆ 6.0 ┆ a   │
└─────┴─────┴─────┘

**Sort by multiple columns.**
For multiple columns we can also use expression syntax.

>>> df.sort(
...     [pl.col("foo"), pl.col("bar") ** 2],
...     reverse=[True, False],
... )
shape: (3, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ str │
╞═════╪═════╪═════╡
│ 3   ┆ 8.0 ┆ c   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2   ┆ 7.0 ┆ b   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 1   ┆ 6.0 ┆ a   │
└─────┴─────┴─────┘
File:      ~/.local/lib/python3.10/site-packages/polars/internals/dataframe/frame.py
Type:      function

to_pandas

In [4]:
df = pl.DataFrame(
    {
        "foo": [1, 2, 3],
        "bar": [6, 7, 8],
        "ham": ["a", "b", "c"],
    }
)
dfp = df.to_pandas()
dfp
Out[4]:
foo bar ham
0 1 6 a
1 2 7 b
2 3 8 c
In [5]:
pl.from_pandas(dfp)
Out[5]:
shape: (3, 3)
foo bar ham
i64 i64 str
1 6 "a"
2 7 "b"
3 8 "c"

Comments