Hands on the Polars Library in Python

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Tips and Traps¶

polars.DataFrame.unique and polars.Series.unique do not maintain the original order by default. To maintain the original order, pass the option maintain_order=True.

Polars¶

Polars is a blazingly fast DataFrames library implemented in Rust using Apache Arrow as memory model.

It is the best replacement of pandas for small data at this time.
Polars support multithreading and lazy computation.
Polars CANNOT handle data larger than memory at this time (even though this might change in future).

Comparison with pandas DataFrame¶

Polars intentionally leaves out the concept of (row) index.
There are no methods such as loc and iloc in Polars. You can use df.get_column / df.[col], df.get_columns / df.[[col1, col2]] to access columns.
Similar to pandas DataFrame, chaining access works but chaining assignment doesn’t work. To assign value of an element, use df[row_index, col_name] = val instead. However, notice that this is inefficient as it updates the whole column under the hood. If you have to update values of a column in a Polars DataFrame, do NOT loop through each cell to update it. Instead, create a Series which contains updated values and then update the column only once. For more discussions, please refer to Efficient way to update a single cell of a Polars DataFrame? .
Polars DataFrame provides APIs DataFrame.from_pandas and DataFrame.to_pandas to convert between Polars/pandas DataFrames.
Polars’ APIs for parsing CSV files is not as flexible as pandas’s. Lucky that we can parse CSV files using pandas and then convert pandas DataFrmaes into Polars DataFrames.

!pip3 install --user polars

Collecting polars
  Downloading polars-0.16.2-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (15.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 15.2/15.2 MB 28.7 MB/s eta 0:00:0000:0100:01
Installing collected packages: polars
Successfully installed polars-0.16.2

[notice] A new release of pip available: 22.3.1 -> 23.0
[notice] To update, run: python3 -m pip install --upgrade pip

import itertools as it
import polars as pl

Series¶

[m for m in dir(pl.Series) if not m.startswith("_")]

['abs',
 'alias',
 'all',
 'any',
 'append',
 'apply',
 'arccos',
 'arccosh',
 'arcsin',
 'arcsinh',
 'arctan',
 'arctanh',
 'arg_max',
 'arg_min',
 'arg_sort',
 'arg_true',
 'arg_unique',
 'argsort',
 'arr',
 'bin',
 'cast',
 'cat',
 'ceil',
 'chunk_lengths',
 'cleared',
 'clip',
 'clip_max',
 'clip_min',
 'clone',
 'cos',
 'cosh',
 'cummax',
 'cummin',
 'cumprod',
 'cumsum',
 'cumulative_eval',
 'describe',
 'diff',
 'dot',
 'drop_nans',
 'drop_nulls',
 'dt',
 'dtype',
 'entropy',
 'estimated_size',
 'ewm_mean',
 'ewm_std',
 'ewm_var',
 'exp',
 'explode',
 'extend_constant',
 'fill_nan',
 'fill_null',
 'filter',
 'flags',
 'floor',
 'get_chunks',
 'has_validity',
 'hash',
 'head',
 'inner_dtype',
 'interpolate',
 'is_boolean',
 'is_datelike',
 'is_duplicated',
 'is_empty',
 'is_finite',
 'is_first',
 'is_float',
 'is_in',
 'is_infinite',
 'is_nan',
 'is_not_nan',
 'is_not_null',
 'is_null',
 'is_numeric',
 'is_sorted',
 'is_unique',
 'is_utf8',
 'item',
 'kurtosis',
 'len',
 'limit',
 'log',
 'log10',
 'max',
 'mean',
 'median',
 'min',
 'mode',
 'n_chunks',
 'n_unique',
 'name',
 'nan_max',
 'nan_min',
 'new_from_index',
 'null_count',
 'pct_change',
 'peak_max',
 'peak_min',
 'product',
 'quantile',
 'rank',
 'rechunk',
 'reinterpret',
 'rename',
 'reshape',
 'reverse',
 'rolling_apply',
 'rolling_max',
 'rolling_mean',
 'rolling_median',
 'rolling_min',
 'rolling_quantile',
 'rolling_skew',
 'rolling_std',
 'rolling_sum',
 'rolling_var',
 'round',
 'sample',
 'search_sorted',
 'series_equal',
 'set',
 'set_at_idx',
 'set_sorted',
 'shape',
 'shift',
 'shift_and_fill',
 'shrink_dtype',
 'shrink_to_fit',
 'shuffle',
 'sign',
 'sin',
 'sinh',
 'skew',
 'slice',
 'sort',
 'sqrt',
 'std',
 'str',
 'struct',
 'sum',
 'tail',
 'take',
 'take_every',
 'tan',
 'tanh',
 'time_unit',
 'to_arrow',
 'to_dummies',
 'to_frame',
 'to_list',
 'to_numpy',
 'to_pandas',
 'to_physical',
 'top_k',
 'unique',
 'unique_counts',
 'value_counts',
 'var',
 'view',
 'zip_with']

s = pl.Series([1, 2, 3])
s

s[0] = 100
s

DataFrame¶

[m for m in dir(pl.DataFrame) if not m.startswith("_")]

['apply',
 'cleared',
 'clone',
 'columns',
 'describe',
 'drop',
 'drop_in_place',
 'drop_nulls',
 'dtypes',
 'estimated_size',
 'explode',
 'extend',
 'fill_nan',
 'fill_null',
 'filter',
 'find_idx_by_name',
 'fold',
 'frame_equal',
 'get_column',
 'get_columns',
 'glimpse',
 'groupby',
 'groupby_dynamic',
 'groupby_rolling',
 'hash_rows',
 'head',
 'height',
 'hstack',
 'insert_at_idx',
 'interpolate',
 'is_duplicated',
 'is_empty',
 'is_unique',
 'item',
 'iterrows',
 'join',
 'join_asof',
 'lazy',
 'limit',
 'max',
 'mean',
 'median',
 'melt',
 'merge_sorted',
 'min',
 'n_chunks',
 'n_unique',
 'null_count',
 'partition_by',
 'pearson_corr',
 'pipe',
 'pivot',
 'product',
 'quantile',
 'rechunk',
 'rename',
 'replace',
 'replace_at_idx',
 'reverse',
 'row',
 'rows',
 'sample',
 'schema',
 'select',
 'shape',
 'shift',
 'shift_and_fill',
 'shrink_to_fit',
 'slice',
 'sort',
 'std',
 'sum',
 'tail',
 'take_every',
 'to_arrow',
 'to_dict',
 'to_dicts',
 'to_dummies',
 'to_numpy',
 'to_pandas',
 'to_series',
 'to_struct',
 'transpose',
 'unique',
 'unnest',
 'unstack',
 'upsample',
 'var',
 'vstack',
 'width',
 'with_column',
 'with_columns',
 'with_row_count',
 'write_avro',
 'write_csv',
 'write_ipc',
 'write_json',
 'write_ndjson',
 'write_parquet']

df = pl.read_csv("https://j.mp/iriscsv")
df

df["sepal_length"]

Similar to pandas DataFrame, chaining assignment does NOT work!

df["sepal_length"][0] = 10000
df

You can slice by row and column at the same time.

df[0, "sepal_length"]

5.1

dir(df)

['__add__',
 '__annotations__',
 '__bool__',
 '__class__',
 '__contains__',
 '__copy__',
 '__deepcopy__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mod__',
 '__module__',
 '__mul__',
 '__ne__',
 '__new__',
 '__radd__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmul__',
 '__setattr__',
 '__setitem__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__sub__',
 '__subclasshook__',
 '__truediv__',
 '__weakref__',
 '_accessors',
 '_comp',
 '_compare_to_non_df',
 '_compare_to_other_df',
 '_df',
 '_from_arrow',
 '_from_dict',
 '_from_dicts',
 '_from_numpy',
 '_from_pandas',
 '_from_pydf',
 '_from_records',
 '_ipython_key_completions_',
 '_pos_idx',
 '_pos_idxs',
 '_read_avro',
 '_read_csv',
 '_read_ipc',
 '_read_json',
 '_read_ndjson',
 '_read_parquet',
 '_repr_html_',
 'apply',
 'cleared',
 'clone',
 'columns',
 'describe',
 'drop',
 'drop_in_place',
 'drop_nulls',
 'dtypes',
 'estimated_size',
 'explode',
 'extend',
 'fill_nan',
 'fill_null',
 'filter',
 'find_idx_by_name',
 'fold',
 'frame_equal',
 'get_column',
 'get_columns',
 'glimpse',
 'groupby',
 'groupby_dynamic',
 'groupby_rolling',
 'hash_rows',
 'head',
 'height',
 'hstack',
 'insert_at_idx',
 'interpolate',
 'is_duplicated',
 'is_empty',
 'is_unique',
 'item',
 'iterrows',
 'join',
 'join_asof',
 'lazy',
 'limit',
 'max',
 'mean',
 'median',
 'melt',
 'merge_sorted',
 'min',
 'n_chunks',
 'n_unique',
 'null_count',
 'partition_by',
 'pearson_corr',
 'pipe',
 'pivot',
 'product',
 'quantile',
 'rechunk',
 'rename',
 'replace',
 'replace_at_idx',
 'reverse',
 'row',
 'rows',
 'sample',
 'schema',
 'select',
 'shape',
 'shift',
 'shift_and_fill',
 'shrink_to_fit',
 'slice',
 'sort',
 'std',
 'sum',
 'tail',
 'take_every',
 'to_arrow',
 'to_dict',
 'to_dicts',
 'to_dummies',
 'to_numpy',
 'to_pandas',
 'to_series',
 'to_struct',
 'transpose',
 'unique',
 'unnest',
 'unstack',
 'upsample',
 'var',
 'vstack',
 'width',
 'with_column',
 'with_columns',
 'with_row_count',
 'write_avro',
 'write_csv',
 'write_ipc',
 'write_json',
 'write_ndjson',
 'write_parquet']

df[0, "sepal_length"] = 1000
df

df.columns

['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']

s = df.get_column("sepal_length")
s

s[0] = 1000
s

df

df.get_column("sepal_length")[0] = 2000
df

type(s)

polars.internals.series.series.Series

pl.all¶

comp.select(pl.all().all())

DataFrame.frame_equal¶

Check whether a DataFrame equals to another DataFrame, elementwise.

df.filter((df["i0"] == 1) & (df["i1"] == 2) & (df["i2"] == 13))[
    ["j0", "j1", "ranks"]
].frame_equal(
    df.filter((df["i0"] == 1) & (df["i1"] == 2) & (df["i2"] == 26))[
        ["j0", "j1", "ranks"]
    ]
)

True

df = pl.DataFrame(
    {
        "id": [0, 1, 2, 3, 4],
        "color": ["red", "green", "green", "red", "red"],
        "shape": ["square", "triangle", "square", "triangle", "square"],
    }
)
df

df.filter(pl.col("sepal_length") > 5).groupby("species").sum()

df = pl.DataFrame(
    {
        "A": [1, 2, 3, 4, 5],
        "fruits": ["banana", "banana", "apple", "apple", "banana"],
        "B": [5, 4, 3, 2, 1],
        "cars": ["beetle", "audi", "beetle", "beetle", "beetle"],
    }
)
df

df.sort("fruits").select(
    [
        "fruits",
        "cars",
        pl.lit("fruits").alias("literal_string_fruits"),
        pl.col("B").filter(pl.col("cars") == "beetle").sum(),
        pl.col("A")
        .filter(pl.col("B") > 2)
        .sum()
        .over("cars")
        .alias("sum_A_by_cars"),  # groups by "cars"
        pl.col("A").sum().over("fruits").alias("sum_A_by_fruits"),  # groups by "fruits"
        pl.col("A")
        .reverse()
        .over("fruits")
        .flatten()
        .alias("rev_A_by_fruits"),  # groups by "fruits
        pl.col("A")
        .sort_by("B")
        .over("fruits")
        .flatten()
        .alias("sort_A_by_B_by_fruits"),  # groups by "fruits"
    ]
)

df.to_dict("records")

{'id': shape: (5,)
 Series: 'id' [i64]
 [
 	0
 	1
 	2
 	3
 	4
 ],
 'color': shape: (5,)
 Series: 'color' [str]
 [
 	"red"
 	"green"
 	"green"
 	"red"
 	"red"
 ],
 'shape': shape: (5,)
 Series: 'shape' [str]
 [
 	"square"
 	"triangle"
 	"square"
 	"triangle"
 	"square"
 ]}

df.to_dicts()

[{'id': 0, 'color': 'red', 'shape': 'square'},
 {'id': 1, 'color': 'green', 'shape': 'triangle'},
 {'id': 2, 'color': 'green', 'shape': 'square'},
 {'id': 3, 'color': 'red', 'shape': 'triangle'},
 {'id': 4, 'color': 'red', 'shape': 'square'}]

dir(df)

['__add__',
 '__annotations__',
 '__bool__',
 '__class__',
 '__contains__',
 '__copy__',
 '__deepcopy__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mod__',
 '__module__',
 '__mul__',
 '__ne__',
 '__new__',
 '__radd__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmul__',
 '__setattr__',
 '__setitem__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__sub__',
 '__subclasshook__',
 '__truediv__',
 '__weakref__',
 '_accessors',
 '_comp',
 '_compare_to_non_df',
 '_compare_to_other_df',
 '_df',
 '_from_arrow',
 '_from_dict',
 '_from_dicts',
 '_from_numpy',
 '_from_pandas',
 '_from_pydf',
 '_from_records',
 '_ipython_key_completions_',
 '_pos_idx',
 '_pos_idxs',
 '_read_avro',
 '_read_csv',
 '_read_ipc',
 '_read_json',
 '_read_ndjson',
 '_read_parquet',
 '_repr_html_',
 'apply',
 'cleared',
 'clone',
 'columns',
 'describe',
 'drop',
 'drop_in_place',
 'drop_nulls',
 'dtypes',
 'estimated_size',
 'explode',
 'extend',
 'fill_nan',
 'fill_null',
 'filter',
 'find_idx_by_name',
 'fold',
 'frame_equal',
 'get_column',
 'get_columns',
 'glimpse',
 'groupby',
 'groupby_dynamic',
 'groupby_rolling',
 'hash_rows',
 'head',
 'height',
 'hstack',
 'insert_at_idx',
 'interpolate',
 'is_duplicated',
 'is_empty',
 'is_unique',
 'item',
 'iterrows',
 'join',
 'join_asof',
 'lazy',
 'limit',
 'max',
 'mean',
 'median',
 'melt',
 'merge_sorted',
 'min',
 'n_chunks',
 'n_unique',
 'null_count',
 'partition_by',
 'pearson_corr',
 'pipe',
 'pivot',
 'product',
 'quantile',
 'rechunk',
 'rename',
 'replace',
 'replace_at_idx',
 'reverse',
 'row',
 'rows',
 'sample',
 'schema',
 'select',
 'shape',
 'shift',
 'shift_and_fill',
 'shrink_to_fit',
 'slice',
 'sort',
 'std',
 'sum',
 'tail',
 'take_every',
 'to_arrow',
 'to_dict',
 'to_dicts',
 'to_dummies',
 'to_numpy',
 'to_pandas',
 'to_series',
 'to_struct',
 'transpose',
 'unique',
 'unnest',
 'unstack',
 'upsample',
 'var',
 'vstack',
 'width',
 'with_column',
 'with_columns',
 'with_row_count',
 'write_avro',
 'write_csv',
 'write_ipc',
 'write_json',
 'write_ndjson',
 'write_parquet']

ss = df.to_struct("ss")
ss

type(ss[0])

dict

sort¶

DataFrame.sort is not in-place. It returns a new DataFrame.

?pl.DataFrame.sort

Signature:
pl.DataFrame.sort(
    self: 'DF',
    by: 'str | pli.Expr | Sequence[str] | Sequence[pli.Expr]',
    reverse: 'bool | list[bool]' = False,
    nulls_last: 'bool' = False,
) -> 'DF | DataFrame'
Docstring:
Sort the DataFrame by column.

Parameters
----------
by
    By which column to sort. Only accepts string.
reverse
    Reverse/descending sort.
nulls_last
    Place null values last. Can only be used if sorted by a single column.

Examples
--------
>>> df = pl.DataFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6.0, 7.0, 8.0],
...         "ham": ["a", "b", "c"],
...     }
... )
>>> df.sort("foo", reverse=True)
shape: (3, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ str │
╞═════╪═════╪═════╡
│ 3   ┆ 8.0 ┆ c   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2   ┆ 7.0 ┆ b   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 1   ┆ 6.0 ┆ a   │
└─────┴─────┴─────┘

**Sort by multiple columns.**
For multiple columns we can also use expression syntax.

>>> df.sort(
...     [pl.col("foo"), pl.col("bar") ** 2],
...     reverse=[True, False],
... )
shape: (3, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ str │
╞═════╪═════╪═════╡
│ 3   ┆ 8.0 ┆ c   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2   ┆ 7.0 ┆ b   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 1   ┆ 6.0 ┆ a   │
└─────┴─────┴─────┘
File:      ~/.local/lib/python3.10/site-packages/polars/internals/dataframe/frame.py
Type:      function

to_pandas¶

df = pl.DataFrame(
    {
        "foo": [1, 2, 3],
        "bar": [6, 7, 8],
        "ham": ["a", "b", "c"],
    }
)
dfp = df.to_pandas()
dfp

from_pandas¶

pl.from_pandas(dfp)