Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Tips and Traps

  1. polars.DataFrame.unique and polars.Series.unique do not maintain the original order by default. To maintain the original order, pass the option maintain_order=True.

Polars

Polars is a blazingly fast DataFrames library implemented in Rust using Apache Arrow as memory model.

  1. It is the best replacement of pandas for small data at this time.

  2. Polars support multithreading and lazy computation.

  3. Polars CANNOT handle data larger than memory at this time (even though this might change in future).

Comparison with pandas DataFrame

  1. Polars intentionally leaves out the concept of (row) index.

  2. There are no methods such as loc and iloc in Polars. You can use df.get_column / df.[col], df.get_columns / df.[[col1, col2]] to access columns.

  3. Similar to pandas DataFrame, chaining access works but chaining assignment doesn’t work. To assign value of an element, use df[row_index, col_name] = val instead. However, notice that this is inefficient as it updates the whole column under the hood. If you have to update values of a column in a Polars DataFrame, do NOT loop through each cell to update it. Instead, create a Series which contains updated values and then update the column only once. For more discussions, please refer to Efficient way to update a single cell of a Polars DataFrame? .

  4. Polars DataFrame provides APIs DataFrame.from_pandas and DataFrame.to_pandas to convert between Polars/pandas DataFrames.

  5. Polars’ APIs for parsing CSV files is not as flexible as pandas’s. Lucky that we can parse CSV files using pandas and then convert pandas DataFrmaes into Polars DataFrames.

!pip3 install --user polars
Collecting polars
  Downloading polars-0.16.2-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (15.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 15.2/15.2 MB 28.7 MB/s eta 0:00:0000:0100:01
Installing collected packages: polars
Successfully installed polars-0.16.2

[notice] A new release of pip available: 22.3.1 -> 23.0
[notice] To update, run: python3 -m pip install --upgrade pip
import itertools as it
import polars as pl

Series

[m for m in dir(pl.Series) if not m.startswith("_")]
['abs', 'alias', 'all', 'any', 'append', 'apply', 'arccos', 'arccosh', 'arcsin', 'arcsinh', 'arctan', 'arctanh', 'arg_max', 'arg_min', 'arg_sort', 'arg_true', 'arg_unique', 'argsort', 'arr', 'bin', 'cast', 'cat', 'ceil', 'chunk_lengths', 'cleared', 'clip', 'clip_max', 'clip_min', 'clone', 'cos', 'cosh', 'cummax', 'cummin', 'cumprod', 'cumsum', 'cumulative_eval', 'describe', 'diff', 'dot', 'drop_nans', 'drop_nulls', 'dt', 'dtype', 'entropy', 'estimated_size', 'ewm_mean', 'ewm_std', 'ewm_var', 'exp', 'explode', 'extend_constant', 'fill_nan', 'fill_null', 'filter', 'flags', 'floor', 'get_chunks', 'has_validity', 'hash', 'head', 'inner_dtype', 'interpolate', 'is_boolean', 'is_datelike', 'is_duplicated', 'is_empty', 'is_finite', 'is_first', 'is_float', 'is_in', 'is_infinite', 'is_nan', 'is_not_nan', 'is_not_null', 'is_null', 'is_numeric', 'is_sorted', 'is_unique', 'is_utf8', 'item', 'kurtosis', 'len', 'limit', 'log', 'log10', 'max', 'mean', 'median', 'min', 'mode', 'n_chunks', 'n_unique', 'name', 'nan_max', 'nan_min', 'new_from_index', 'null_count', 'pct_change', 'peak_max', 'peak_min', 'product', 'quantile', 'rank', 'rechunk', 'reinterpret', 'rename', 'reshape', 'reverse', 'rolling_apply', 'rolling_max', 'rolling_mean', 'rolling_median', 'rolling_min', 'rolling_quantile', 'rolling_skew', 'rolling_std', 'rolling_sum', 'rolling_var', 'round', 'sample', 'search_sorted', 'series_equal', 'set', 'set_at_idx', 'set_sorted', 'shape', 'shift', 'shift_and_fill', 'shrink_dtype', 'shrink_to_fit', 'shuffle', 'sign', 'sin', 'sinh', 'skew', 'slice', 'sort', 'sqrt', 'std', 'str', 'struct', 'sum', 'tail', 'take', 'take_every', 'tan', 'tanh', 'time_unit', 'to_arrow', 'to_dummies', 'to_frame', 'to_list', 'to_numpy', 'to_pandas', 'to_physical', 'top_k', 'unique', 'unique_counts', 'value_counts', 'var', 'view', 'zip_with']
s = pl.Series([1, 2, 3])
s
Loading...
s[0] = 100
s
Loading...

DataFrame

[m for m in dir(pl.DataFrame) if not m.startswith("_")]
['apply', 'cleared', 'clone', 'columns', 'describe', 'drop', 'drop_in_place', 'drop_nulls', 'dtypes', 'estimated_size', 'explode', 'extend', 'fill_nan', 'fill_null', 'filter', 'find_idx_by_name', 'fold', 'frame_equal', 'get_column', 'get_columns', 'glimpse', 'groupby', 'groupby_dynamic', 'groupby_rolling', 'hash_rows', 'head', 'height', 'hstack', 'insert_at_idx', 'interpolate', 'is_duplicated', 'is_empty', 'is_unique', 'item', 'iterrows', 'join', 'join_asof', 'lazy', 'limit', 'max', 'mean', 'median', 'melt', 'merge_sorted', 'min', 'n_chunks', 'n_unique', 'null_count', 'partition_by', 'pearson_corr', 'pipe', 'pivot', 'product', 'quantile', 'rechunk', 'rename', 'replace', 'replace_at_idx', 'reverse', 'row', 'rows', 'sample', 'schema', 'select', 'shape', 'shift', 'shift_and_fill', 'shrink_to_fit', 'slice', 'sort', 'std', 'sum', 'tail', 'take_every', 'to_arrow', 'to_dict', 'to_dicts', 'to_dummies', 'to_numpy', 'to_pandas', 'to_series', 'to_struct', 'transpose', 'unique', 'unnest', 'unstack', 'upsample', 'var', 'vstack', 'width', 'with_column', 'with_columns', 'with_row_count', 'write_avro', 'write_csv', 'write_ipc', 'write_json', 'write_ndjson', 'write_parquet']
df = pl.read_csv("https://j.mp/iriscsv")
df
Loading...
df["sepal_length"]
Loading...
df["sepal_length"][0] = 10000
df
Loading...

You can slice by row and column at the same time.

df[0, "sepal_length"]
5.1
dir(df)
['__add__', '__annotations__', '__bool__', '__class__', '__contains__', '__copy__', '__deepcopy__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__mod__', '__module__', '__mul__', '__ne__', '__new__', '__radd__', '__reduce__', '__reduce_ex__', '__repr__', '__rmul__', '__setattr__', '__setitem__', '__setstate__', '__sizeof__', '__str__', '__sub__', '__subclasshook__', '__truediv__', '__weakref__', '_accessors', '_comp', '_compare_to_non_df', '_compare_to_other_df', '_df', '_from_arrow', '_from_dict', '_from_dicts', '_from_numpy', '_from_pandas', '_from_pydf', '_from_records', '_ipython_key_completions_', '_pos_idx', '_pos_idxs', '_read_avro', '_read_csv', '_read_ipc', '_read_json', '_read_ndjson', '_read_parquet', '_repr_html_', 'apply', 'cleared', 'clone', 'columns', 'describe', 'drop', 'drop_in_place', 'drop_nulls', 'dtypes', 'estimated_size', 'explode', 'extend', 'fill_nan', 'fill_null', 'filter', 'find_idx_by_name', 'fold', 'frame_equal', 'get_column', 'get_columns', 'glimpse', 'groupby', 'groupby_dynamic', 'groupby_rolling', 'hash_rows', 'head', 'height', 'hstack', 'insert_at_idx', 'interpolate', 'is_duplicated', 'is_empty', 'is_unique', 'item', 'iterrows', 'join', 'join_asof', 'lazy', 'limit', 'max', 'mean', 'median', 'melt', 'merge_sorted', 'min', 'n_chunks', 'n_unique', 'null_count', 'partition_by', 'pearson_corr', 'pipe', 'pivot', 'product', 'quantile', 'rechunk', 'rename', 'replace', 'replace_at_idx', 'reverse', 'row', 'rows', 'sample', 'schema', 'select', 'shape', 'shift', 'shift_and_fill', 'shrink_to_fit', 'slice', 'sort', 'std', 'sum', 'tail', 'take_every', 'to_arrow', 'to_dict', 'to_dicts', 'to_dummies', 'to_numpy', 'to_pandas', 'to_series', 'to_struct', 'transpose', 'unique', 'unnest', 'unstack', 'upsample', 'var', 'vstack', 'width', 'with_column', 'with_columns', 'with_row_count', 'write_avro', 'write_csv', 'write_ipc', 'write_json', 'write_ndjson', 'write_parquet']
df[0, "sepal_length"] = 1000
df
Loading...
df.columns
['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
s = df.get_column("sepal_length")
s
Loading...
s[0] = 1000
s
Loading...
df
Loading...
df.get_column("sepal_length")[0] = 2000
df
Loading...
type(s)
polars.internals.series.series.Series

pl.all

comp.select(pl.all().all())
Loading...

DataFrame.frame_equal

Check whether a DataFrame equals to another DataFrame, elementwise.

df.filter((df["i0"] == 1) & (df["i1"] == 2) & (df["i2"] == 13))[
    ["j0", "j1", "ranks"]
].frame_equal(
    df.filter((df["i0"] == 1) & (df["i1"] == 2) & (df["i2"] == 26))[
        ["j0", "j1", "ranks"]
    ]
)
True
df = pl.DataFrame(
    {
        "id": [0, 1, 2, 3, 4],
        "color": ["red", "green", "green", "red", "red"],
        "shape": ["square", "triangle", "square", "triangle", "square"],
    }
)
df
Loading...
df.filter(pl.col("sepal_length") > 5).groupby("species").sum()
Loading...
df = pl.DataFrame(
    {
        "A": [1, 2, 3, 4, 5],
        "fruits": ["banana", "banana", "apple", "apple", "banana"],
        "B": [5, 4, 3, 2, 1],
        "cars": ["beetle", "audi", "beetle", "beetle", "beetle"],
    }
)
df
Loading...
df.sort("fruits").select(
    [
        "fruits",
        "cars",
        pl.lit("fruits").alias("literal_string_fruits"),
        pl.col("B").filter(pl.col("cars") == "beetle").sum(),
        pl.col("A")
        .filter(pl.col("B") > 2)
        .sum()
        .over("cars")
        .alias("sum_A_by_cars"),  # groups by "cars"
        pl.col("A").sum().over("fruits").alias("sum_A_by_fruits"),  # groups by "fruits"
        pl.col("A")
        .reverse()
        .over("fruits")
        .flatten()
        .alias("rev_A_by_fruits"),  # groups by "fruits
        pl.col("A")
        .sort_by("B")
        .over("fruits")
        .flatten()
        .alias("sort_A_by_B_by_fruits"),  # groups by "fruits"
    ]
)
Loading...
df.to_dict("records")
{'id': shape: (5,) Series: 'id' [i64] [ 0 1 2 3 4 ], 'color': shape: (5,) Series: 'color' [str] [ "red" "green" "green" "red" "red" ], 'shape': shape: (5,) Series: 'shape' [str] [ "square" "triangle" "square" "triangle" "square" ]}
df.to_dicts()
[{'id': 0, 'color': 'red', 'shape': 'square'}, {'id': 1, 'color': 'green', 'shape': 'triangle'}, {'id': 2, 'color': 'green', 'shape': 'square'}, {'id': 3, 'color': 'red', 'shape': 'triangle'}, {'id': 4, 'color': 'red', 'shape': 'square'}]
dir(df)
['__add__', '__annotations__', '__bool__', '__class__', '__contains__', '__copy__', '__deepcopy__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__mod__', '__module__', '__mul__', '__ne__', '__new__', '__radd__', '__reduce__', '__reduce_ex__', '__repr__', '__rmul__', '__setattr__', '__setitem__', '__setstate__', '__sizeof__', '__str__', '__sub__', '__subclasshook__', '__truediv__', '__weakref__', '_accessors', '_comp', '_compare_to_non_df', '_compare_to_other_df', '_df', '_from_arrow', '_from_dict', '_from_dicts', '_from_numpy', '_from_pandas', '_from_pydf', '_from_records', '_ipython_key_completions_', '_pos_idx', '_pos_idxs', '_read_avro', '_read_csv', '_read_ipc', '_read_json', '_read_ndjson', '_read_parquet', '_repr_html_', 'apply', 'cleared', 'clone', 'columns', 'describe', 'drop', 'drop_in_place', 'drop_nulls', 'dtypes', 'estimated_size', 'explode', 'extend', 'fill_nan', 'fill_null', 'filter', 'find_idx_by_name', 'fold', 'frame_equal', 'get_column', 'get_columns', 'glimpse', 'groupby', 'groupby_dynamic', 'groupby_rolling', 'hash_rows', 'head', 'height', 'hstack', 'insert_at_idx', 'interpolate', 'is_duplicated', 'is_empty', 'is_unique', 'item', 'iterrows', 'join', 'join_asof', 'lazy', 'limit', 'max', 'mean', 'median', 'melt', 'merge_sorted', 'min', 'n_chunks', 'n_unique', 'null_count', 'partition_by', 'pearson_corr', 'pipe', 'pivot', 'product', 'quantile', 'rechunk', 'rename', 'replace', 'replace_at_idx', 'reverse', 'row', 'rows', 'sample', 'schema', 'select', 'shape', 'shift', 'shift_and_fill', 'shrink_to_fit', 'slice', 'sort', 'std', 'sum', 'tail', 'take_every', 'to_arrow', 'to_dict', 'to_dicts', 'to_dummies', 'to_numpy', 'to_pandas', 'to_series', 'to_struct', 'transpose', 'unique', 'unnest', 'unstack', 'upsample', 'var', 'vstack', 'width', 'with_column', 'with_columns', 'with_row_count', 'write_avro', 'write_csv', 'write_ipc', 'write_json', 'write_ndjson', 'write_parquet']
ss = df.to_struct("ss")
ss
Loading...
type(ss[0])
dict

sort

DataFrame.sort is not in-place. It returns a new DataFrame.

?pl.DataFrame.sort
Signature: pl.DataFrame.sort( self: 'DF', by: 'str | pli.Expr | Sequence[str] | Sequence[pli.Expr]', reverse: 'bool | list[bool]' = False, nulls_last: 'bool' = False, ) -> 'DF | DataFrame' Docstring: Sort the DataFrame by column. Parameters ---------- by By which column to sort. Only accepts string. reverse Reverse/descending sort. nulls_last Place null values last. Can only be used if sorted by a single column. Examples -------- >>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6.0, 7.0, 8.0], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.sort("foo", reverse=True) shape: (3, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str │ ╞═════╪═════╪═════╡ │ 3 ┆ 8.0 ┆ c │ ├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤ │ 2 ┆ 7.0 ┆ b │ ├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤ │ 1 ┆ 6.0 ┆ a │ └─────┴─────┴─────┘ **Sort by multiple columns.** For multiple columns we can also use expression syntax. >>> df.sort( ... [pl.col("foo"), pl.col("bar") ** 2], ... reverse=[True, False], ... ) shape: (3, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str │ ╞═════╪═════╪═════╡ │ 3 ┆ 8.0 ┆ c │ ├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤ │ 2 ┆ 7.0 ┆ b │ ├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤ │ 1 ┆ 6.0 ┆ a │ └─────┴─────┴─────┘ File: ~/.local/lib/python3.10/site-packages/polars/internals/dataframe/frame.py Type: function

to_pandas

df = pl.DataFrame(
    {
        "foo": [1, 2, 3],
        "bar": [6, 7, 8],
        "ham": ["a", "b", "c"],
    }
)
dfp = df.to_pandas()
dfp
Loading...
pl.from_pandas(dfp)
Loading...