Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Comment

  1. The order of elements within each group are preserved (as the original order).

  2. groupby works exactly the same on index if the index is named.

  3. The order of columns in groupby matters if you want unstack the results later.

  4. groupby works on columns too and it can group by some level of a MultiIndex.

groupby on a Column of An Empty Data Frame

import pandas as pd

df = pd.DataFrame({'x': [], 'y': [], 'z': []})

df
Loading...
df.groupby('x')[['y', 'z']].sum()
Loading...

groupby on the Index of An Empty Data Frame

import pandas as pd

df = pd.DataFrame({'x': [], 'y': [], 'z': []})
df.set_index('x')
df
Loading...
df.groupby('x')[['y', 'z']].sum()
Loading...

groupby on Non-empty Data Frames

import pandas as pd

df = pd.DataFrame(
    {
        'x': [3, 3, 1, 10, 1, 10],
        'y': [1, 2, 3, 4, 5, 6],
        'z': [6, 5, 4, 3, 2, 1]
    }
)

df
Loading...
df.groupby('x')[['y', 'z']].sum()
Loading...
df.groupby('x').sum()
Loading...
df.groupby('x')[['y', 'z']].sum()
Loading...
df.groupby(['x'], sort=False).sum()
Loading...

Aggregation Function Taking Extra Parameters

import pandas as pd
import numpy as np

df = pd.DataFrame(
    {
        'x': [3, 3, 1, 10, 1, 10],
        'y': [1, 2, 3, 4, 5, 6],
        'z': [6, 5, 4, 3, 2, 1]
    }
)

df
Loading...
def my_min(x, offset=0):
    return 0 + min(x)
df.groupby('x')[['y', 'z']].agg(my_min, offset=1000)
Loading...
df.groupby('x')[['y', 'z']].agg(min)
Loading...
df.groupby(['x'], sort=False).apply(lambda x: x)
Loading...

agg

Notice that most aggregation functions just ignore NaN!!!

min on each column inside each group.

df.groupby('x').agg('min')
Loading...

Multiple aggregations for each column.

df.groupby('x').agg(['min', 'max'])
Loading...

Aggregate on the column y only.

df.groupby('x').y.agg(['min', 'max'])
Loading...

Group by Multiple Criterias

When grouping by multiple criterias, you can mix labels and series together.

df.groupby(['x', 'y']).sum()
Loading...
df.groupby(['x', df.y]).sum()
Loading...
df.groupby([df.x, df.y]).sum()
Loading...

Naming Aggreated Columns

import pandas as pd
import numpy as np

df = pd.DataFrame(
    {
        'x': [3, 3, 1, 10, 1, 10],
        'y': [1, 2, 3, 4, 5, 6],
        'z': [6, 5, 4, 3, 2, 1]
    }
)
df
Loading...
df.groupby('x').agg(
    y_avg=('y', np.average),
    y_sum=('y', sum),
    x_sum=('x', sum),
)
Loading...

You CANNOT use multiple lambda functions in the aggregate method as of pandas 0.25.3. A patch has been made but not released yet. Before the fix is released, you just need to define lambda functions as regular named functions to avoid the issue.

df.groupby('x').agg(
    y_avg=('y', lambda x: np.average(x)),
    y_sum=('y', lambda x: sum(x)),
    x_sum=('x', sum),
)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-4-859b6ed6abd5> in <module>
      2     y_avg=('y', lambda x: np.average(x)),
      3     y_sum=('y', lambda x: sum(x)),
----> 4     x_sum=('x', sum),
      5 )

/usr/local/lib/python3.7/dist-packages/pandas/core/groupby/generic.py in aggregate(self, arg, *args, **kwargs)
   1453     @Appender(_shared_docs["aggregate"])
   1454     def aggregate(self, arg=None, *args, **kwargs):
-> 1455         return super().aggregate(arg, *args, **kwargs)
   1456 
   1457     agg = aggregate

/usr/local/lib/python3.7/dist-packages/pandas/core/groupby/generic.py in aggregate(self, func, *args, **kwargs)
    262 
    263         if relabeling:
--> 264             result = result[order]
    265             result.columns = columns
    266 

/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py in __getitem__(self, key)
   2999             if is_iterator(key):
   3000                 key = list(key)
-> 3001             indexer = self.loc._convert_to_indexer(key, axis=1, raise_missing=True)
   3002 
   3003         # take() does not accept boolean indexers

/usr/local/lib/python3.7/dist-packages/pandas/core/indexing.py in _convert_to_indexer(self, obj, axis, is_setter, raise_missing)
   1283                 # When setting, missing keys are not allowed, even with .loc:
   1284                 kwargs = {"raise_missing": True if is_setter else raise_missing}
-> 1285                 return self._get_listlike_indexer(obj, axis, **kwargs)[1]
   1286         else:
   1287             try:

/usr/local/lib/python3.7/dist-packages/pandas/core/indexing.py in _get_listlike_indexer(self, key, axis, raise_missing)
   1090 
   1091         self._validate_read_indexer(
-> 1092             keyarr, indexer, o._get_axis_number(axis), raise_missing=raise_missing
   1093         )
   1094         return keyarr, indexer

/usr/local/lib/python3.7/dist-packages/pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis, raise_missing)
   1183             if not (self.name == "loc" and not raise_missing):
   1184                 not_found = list(set(key) - set(ax))
-> 1185                 raise KeyError("{} not in index".format(not_found))
   1186 
   1187             # we skip the warning on Categorical/Interval

KeyError: "[('y', '<lambda>')] not in index"

By default, the groupby column is used as the index.

r = df.groupby('x').agg({'y': 'max', 'z': ['max', 'min', 'mean', 'count']})
r
Loading...

You can have the groupby column as an column in final results using the option as_index=False.

r = df.groupby('x',
               as_index=False).agg({
                   'y': 'max',
                   'z': ['max', 'min', 'mean', 'count']
               })
r
Loading...
r.columns
MultiIndex(levels=[['y', 'z', 'x'], ['count', 'max', 'mean', 'min', '']], labels=[[2, 0, 1, 1, 1, 1], [4, 1, 1, 3, 2, 0]])
r.columns = ['x', 'ymax', 'zmax', 'zmin', 'zmean', 'zcnt']
r
Loading...

Equivalent of Having

df.groupby(‘col’).filter

2**0.5
1.4142135623730951
pow(2, 0.5)
1.4142135623730951
s = pd.Series([1, 2, 3])
s
0 1 1 2 2 3 dtype: int64
s['abc'] = 1000
s
0 1 1 2 2 3 abc 1000 dtype: int64

Aggregation Using apply

df.apply(np.average, args=(None, df.z))
x 3.761905 y 2.666667 z 4.333333 dtype: float64
df.drop('z', axis=1)
Loading...
df.apply(lambda col: np.average(col, weights=df.z))
x 3.761905 y 2.666667 z 4.333333 dtype: float64
np.average(df.x, weights=df.z)
3.761904761904762
np.average(df.y, weights=df.z)
2.6666666666666665
def my_sum(df):
    w = df.z / df.z.sum()
    return df.apply(np.average, args=
    
df.groupby('x')[['y', 'z']].apply(my_sum)
Loading...
def my_sum(df):
    w = df.z / df.z.sum()
    return df.apply(np.average, args=
    
df.groupby('x')[['y', 'z']].apply(my_sum)
Loading...
import numpy as np
df.apply(np.average, args=(df.z, ))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-20-cf60e77863f7> in <module>()
      1 import numpy as np
----> 2 df.apply(np.average, args=(df.z,))

/opt/conda/lib/python3.6/site-packages/pandas/core/frame.py in apply(self, func, axis, broadcast, raw, reduce, args, **kwds)
   4875                         f, axis,
   4876                         reduce=reduce,
-> 4877                         ignore_failures=ignore_failures)
   4878             else:
   4879                 return self._apply_broadcast(f, axis)

/opt/conda/lib/python3.6/site-packages/pandas/core/frame.py in _apply_standard(self, func, axis, ignore_failures, reduce)
   4971             try:
   4972                 for i, v in enumerate(series_gen):
-> 4973                     results[i] = func(v)
   4974                     keys.append(v.name)
   4975             except Exception as e:

/opt/conda/lib/python3.6/site-packages/pandas/core/frame.py in f(x)
   4852         if kwds or args and not isinstance(func, np.ufunc):
   4853             def f(x):
-> 4854                 return func(x, *args, **kwds)
   4855         else:
   4856             f = func

/opt/conda/lib/python3.6/site-packages/numpy/lib/function_base.py in average(a, axis, weights, returned)
   1126 
   1127     if weights is None:
-> 1128         avg = a.mean(axis)
   1129         scl = avg.dtype.type(a.size/avg.size)
   1130     else:

/opt/conda/lib/python3.6/site-packages/numpy/core/_methods.py in _mean(a, axis, dtype, out, keepdims)
     55 
     56     is_float16_result = False
---> 57     rcount = _count_reduce_items(arr, axis)
     58     # Make this warning show up first
     59     if rcount == 0:

/opt/conda/lib/python3.6/site-packages/numpy/core/_methods.py in _count_reduce_items(arr, axis)
     48     items = 1
     49     for ax in axis:
---> 50         items *= arr.shape[ax]
     51     return items
     52 

TypeError: ('tuple indices must be integers or slices, not Series', 'occurred at index x')
?df.apply
Signature: df.apply(func, axis=0, broadcast=False, raw=False, reduce=None, args=(), **kwds) Docstring: Applies function along input axis of DataFrame. Objects passed to functions are Series objects having index either the DataFrame's index (axis=0) or the columns (axis=1). Return type depends on whether passed function aggregates, or the reduce argument if the DataFrame is empty. Parameters ---------- func : function Function to apply to each column/row axis : {0 or 'index', 1 or 'columns'}, default 0 * 0 or 'index': apply function to each column * 1 or 'columns': apply function to each row broadcast : boolean, default False For aggregation functions, return object of same size with values propagated raw : boolean, default False If False, convert each row or column into a Series. If raw=True the passed function will receive ndarray objects instead. If you are just applying a NumPy reduction function this will achieve much better performance reduce : boolean or None, default None Try to apply reduction procedures. If the DataFrame is empty, apply will use reduce to determine whether the result should be a Series or a DataFrame. If reduce is None (the default), apply's return value will be guessed by calling func an empty Series (note: while guessing, exceptions raised by func will be ignored). If reduce is True a Series will always be returned, and if False a DataFrame will always be returned. args : tuple Positional arguments to pass to function in addition to the array/series Additional keyword arguments will be passed as keywords to the function Notes ----- In the current implementation apply calls func twice on the first column/row to decide whether it can take a fast or slow code path. This can lead to unexpected behavior if func has side-effects, as they will take effect twice for the first column/row. Examples -------- >>> df.apply(numpy.sqrt) # returns DataFrame >>> df.apply(numpy.sum, axis=0) # equiv to df.sum(0) >>> df.apply(numpy.sum, axis=1) # equiv to df.sum(1) See also -------- DataFrame.applymap: For elementwise operations DataFrame.aggregate: only perform aggregating type operations DataFrame.transform: only perform transformating type operations Returns ------- applied : Series or DataFrame File: /opt/conda/lib/python3.6/site-packages/pandas/core/frame.py Type: method

Comment

By default the group keys are sorted during the groupby operation. You may however pass sort=False to keep keys in the order that they first appear. This will also potential speedup the code.

?pd.DataFrame.apply
Signature: pd.DataFrame.apply(self, func, axis=0, broadcast=False, raw=False, reduce=None, args=(), **kwds) Docstring: Applies function along input axis of DataFrame. Objects passed to functions are Series objects having index either the DataFrame's index (axis=0) or the columns (axis=1). Return type depends on whether passed function aggregates, or the reduce argument if the DataFrame is empty. Parameters ---------- func : function Function to apply to each column/row axis : {0 or 'index', 1 or 'columns'}, default 0 * 0 or 'index': apply function to each column * 1 or 'columns': apply function to each row broadcast : boolean, default False For aggregation functions, return object of same size with values propagated raw : boolean, default False If False, convert each row or column into a Series. If raw=True the passed function will receive ndarray objects instead. If you are just applying a NumPy reduction function this will achieve much better performance reduce : boolean or None, default None Try to apply reduction procedures. If the DataFrame is empty, apply will use reduce to determine whether the result should be a Series or a DataFrame. If reduce is None (the default), apply's return value will be guessed by calling func an empty Series (note: while guessing, exceptions raised by func will be ignored). If reduce is True a Series will always be returned, and if False a DataFrame will always be returned. args : tuple Positional arguments to pass to function in addition to the array/series Additional keyword arguments will be passed as keywords to the function Notes ----- In the current implementation apply calls func twice on the first column/row to decide whether it can take a fast or slow code path. This can lead to unexpected behavior if func has side-effects, as they will take effect twice for the first column/row. Examples -------- >>> df.apply(numpy.sqrt) # returns DataFrame >>> df.apply(numpy.sum, axis=0) # equiv to df.sum(0) >>> df.apply(numpy.sum, axis=1) # equiv to df.sum(1) See also -------- DataFrame.applymap: For elementwise operations DataFrame.aggregate: only perform aggregating type operations DataFrame.transform: only perform transformating type operations Returns ------- applied : Series or DataFrame File: /opt/conda/lib/python3.6/site-packages/pandas/core/frame.py Type: function