Ben Chuanlong Du's Blog

It is never too late to learn.

Aggregation in pandas DataFrame

Comment

  1. The order of elements within each group are preserved (as the original order).

  2. groupby works exactly the same on index if the index is named.

  3. The order of columns in groupby matters if you want unstack the results later.

  4. groupby works on columns too and it can group by some level of a MultiIndex.

groupby on a Column of An Empty Data Frame

In [1]:
import pandas as pd

df = pd.DataFrame({'x': [], 'y': [], 'z': []})

df
Out[1]:
x y z
In [2]:
df.groupby('x')[['y', 'z']].sum()
Out[2]:
y z
x

groupby on the Index of An Empty Data Frame

In [3]:
import pandas as pd

df = pd.DataFrame({'x': [], 'y': [], 'z': []})
df.set_index('x')
df
Out[3]:
x y z
In [4]:
df.groupby('x')[['y', 'z']].sum()
Out[4]:
y z
x

groupby on Non-empty Data Frames

In [5]:
import pandas as pd

df = pd.DataFrame(
    {
        'x': [3, 3, 1, 10, 1, 10],
        'y': [1, 2, 3, 4, 5, 6],
        'z': [6, 5, 4, 3, 2, 1]
    }
)

df
Out[5]:
x y z
0 3 1 6
1 3 2 5
2 1 3 4
3 10 4 3
4 1 5 2
5 10 6 1
In [6]:
df.groupby('x')[['y', 'z']].sum()
Out[6]:
y z
x
1 8 6
3 3 11
10 10 4
In [7]:
df.groupby('x').sum()
Out[7]:
y z
x
1 8 6
3 3 11
10 10 4
In [8]:
df.groupby('x')[['y', 'z']].sum()
Out[8]:
y z
x
1 8 6
3 3 11
10 10 4
In [9]:
df.groupby(['x'], sort=False).sum()
Out[9]:
y z
x
3 3 11
1 8 6
10 10 4

Aggregation Function Taking Extra Parameters

In [2]:
import pandas as pd
import numpy as np

df = pd.DataFrame(
    {
        'x': [3, 3, 1, 10, 1, 10],
        'y': [1, 2, 3, 4, 5, 6],
        'z': [6, 5, 4, 3, 2, 1]
    }
)

df
Out[2]:
x y z
0 3 1 6
1 3 2 5
2 1 3 4
3 10 4 3
4 1 5 2
5 10 6 1
In [3]:
def my_min(x, offset=0):
    return 0 + min(x)
In [4]:
df.groupby('x')[['y', 'z']].agg(my_min, offset=1000)
Out[4]:
y z
x
1 3 2
3 1 5
10 4 1
In [5]:
df.groupby('x')[['y', 'z']].agg(min)
Out[5]:
y z
x
1 3 2
3 1 5
10 4 1
In [4]:
df.groupby(['x'], sort=False).apply(lambda x: x)
Out[4]:
x y z
0 3 1 6
1 3 2 5
2 1 3 4
3 10 4 3
4 1 5 2
5 10 6 1

agg

Notice that most aggregation functions just ignore NaN!!!

min on each column inside each group.

In [5]:
df.groupby('x').agg('min')
Out[5]:
y z
x
1 3 2
3 1 5
10 4 1

Multiple aggregations for each column.

In [6]:
df.groupby('x').agg(['min', 'max'])
Out[6]:
y z
min max min max
x
1 3 5 2 4
3 1 2 5 6
10 4 6 1 3

Aggregate on the column y only.

In [7]:
df.groupby('x').y.agg(['min', 'max'])
Out[7]:
min max
x
1 3 5
3 1 2
10 4 6

Group by Multiple Criterias

When grouping by multiple criterias, you can mix labels and series together.

In [6]:
df.groupby(['x', 'y']).sum()
Out[6]:
z
x y
1 3 4
5 2
3 1 6
2 5
10 4 3
6 1
In [7]:
df.groupby(['x', df.y]).sum()
Out[7]:
z
x y
1 3 4
5 2
3 1 6
2 5
10 4 3
6 1
In [8]:
df.groupby([df.x, df.y]).sum()
Out[8]:
z
x y
1 3 4
5 2
3 1 6
2 5
10 4 3
6 1

Naming Aggreated Columns

In [2]:
import pandas as pd
import numpy as np

df = pd.DataFrame(
    {
        'x': [3, 3, 1, 10, 1, 10],
        'y': [1, 2, 3, 4, 5, 6],
        'z': [6, 5, 4, 3, 2, 1]
    }
)
df
Out[2]:
x y z
0 3 1 6
1 3 2 5
2 1 3 4
3 10 4 3
4 1 5 2
5 10 6 1
In [3]:
df.groupby('x').agg(
    y_avg=('y', np.average),
    y_sum=('y', sum),
    x_sum=('x', sum),
)
Out[3]:
y_avg y_sum x_sum
x
1 4.0 8 2
3 1.5 3 6
10 5.0 10 20

You CANNOT use multiple lambda functions in the aggregate method as of pandas 0.25.3. A patch has been made but not released yet. Before the fix is released, you just need to define lambda functions as regular named functions to avoid the issue.

In [4]:
df.groupby('x').agg(
    y_avg=('y', lambda x: np.average(x)),
    y_sum=('y', lambda x: sum(x)),
    x_sum=('x', sum),
)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-4-859b6ed6abd5> in <module>
      2     y_avg=('y', lambda x: np.average(x)),
      3     y_sum=('y', lambda x: sum(x)),
----> 4     x_sum=('x', sum),
      5 )

/usr/local/lib/python3.7/dist-packages/pandas/core/groupby/generic.py in aggregate(self, arg, *args, **kwargs)
   1453     @Appender(_shared_docs["aggregate"])
   1454     def aggregate(self, arg=None, *args, **kwargs):
-> 1455         return super().aggregate(arg, *args, **kwargs)
   1456 
   1457     agg = aggregate

/usr/local/lib/python3.7/dist-packages/pandas/core/groupby/generic.py in aggregate(self, func, *args, **kwargs)
    262 
    263         if relabeling:
--> 264             result = result[order]
    265             result.columns = columns
    266 

/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py in __getitem__(self, key)
   2999             if is_iterator(key):
   3000                 key = list(key)
-> 3001             indexer = self.loc._convert_to_indexer(key, axis=1, raise_missing=True)
   3002 
   3003         # take() does not accept boolean indexers

/usr/local/lib/python3.7/dist-packages/pandas/core/indexing.py in _convert_to_indexer(self, obj, axis, is_setter, raise_missing)
   1283                 # When setting, missing keys are not allowed, even with .loc:
   1284                 kwargs = {"raise_missing": True if is_setter else raise_missing}
-> 1285                 return self._get_listlike_indexer(obj, axis, **kwargs)[1]
   1286         else:
   1287             try:

/usr/local/lib/python3.7/dist-packages/pandas/core/indexing.py in _get_listlike_indexer(self, key, axis, raise_missing)
   1090 
   1091         self._validate_read_indexer(
-> 1092             keyarr, indexer, o._get_axis_number(axis), raise_missing=raise_missing
   1093         )
   1094         return keyarr, indexer

/usr/local/lib/python3.7/dist-packages/pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis, raise_missing)
   1183             if not (self.name == "loc" and not raise_missing):
   1184                 not_found = list(set(key) - set(ax))
-> 1185                 raise KeyError("{} not in index".format(not_found))
   1186 
   1187             # we skip the warning on Categorical/Interval

KeyError: "[('y', '<lambda>')] not in index"

By default, the groupby column is used as the index.

In [8]:
r = df.groupby('x').agg({'y': 'max', 'z': ['max', 'min', 'mean', 'count']})
r
Out[8]:
y z
max max min mean count
x
1 5 4 2 3.0 2
3 2 6 5 5.5 2
10 6 3 1 2.0 2

You can have the groupby column as an column in final results using the option as_index=False.

In [9]:
r = df.groupby('x',
               as_index=False).agg({
                   'y': 'max',
                   'z': ['max', 'min', 'mean', 'count']
               })
r
Out[9]:
x y z
max max min mean count
0 1 5 4 2 3.0 2
1 3 2 6 5 5.5 2
2 10 6 3 1 2.0 2
In [10]:
r.columns
Out[10]:
MultiIndex(levels=[['y', 'z', 'x'], ['count', 'max', 'mean', 'min', '']],
           labels=[[2, 0, 1, 1, 1, 1], [4, 1, 1, 3, 2, 0]])
In [11]:
r.columns = ['x', 'ymax', 'zmax', 'zmin', 'zmean', 'zcnt']
r
Out[11]:
x ymax zmax zmin zmean zcnt
0 1 5 4 2 3.0 2
1 3 2 6 5 5.5 2
2 10 6 3 1 2.0 2

Equivalent of Having

df.groupby('col').filter

In [38]:
2**0.5
Out[38]:
1.4142135623730951
In [39]:
pow(2, 0.5)
Out[39]:
1.4142135623730951
In [45]:
s = pd.Series([1, 2, 3])
s
Out[45]:
0    1
1    2
2    3
dtype: int64
In [46]:
s['abc'] = 1000
In [47]:
s
Out[47]:
0         1
1         2
2         3
abc    1000
dtype: int64
In [ ]:
 

Aggregation Using apply

In [30]:
df.apply(np.average, args=(None, df.z))
Out[30]:
x    3.761905
y    2.666667
z    4.333333
dtype: float64
In [36]:
df.drop('z', axis=1)
Out[36]:
x y
0 3 1
1 3 2
2 1 3
3 10 4
4 1 5
5 10 6
In [33]:
df.apply(lambda col: np.average(col, weights=df.z))
Out[33]:
x    3.761905
y    2.666667
z    4.333333
dtype: float64
In [31]:
np.average(df.x, weights=df.z)
Out[31]:
3.761904761904762
In [32]:
np.average(df.y, weights=df.z)
Out[32]:
2.6666666666666665
In [27]:
def my_sum(df):
    w = df.z / df.z.sum()
    return df.apply(np.average, args=
    
df.groupby('x')[['y', 'z']].apply(my_sum)
Out[27]:
y z
x
1 8 6
3 3 11
10 10 4
In [27]:
def my_sum(df):
    w = df.z / df.z.sum()
    return df.apply(np.average, args=
    
df.groupby('x')[['y', 'z']].apply(my_sum)
Out[27]:
y z
x
1 8 6
3 3 11
10 10 4
In [20]:
import numpy as np
df.apply(np.average, args=(df.z, ))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-20-cf60e77863f7> in <module>()
      1 import numpy as np
----> 2 df.apply(np.average, args=(df.z,))

/opt/conda/lib/python3.6/site-packages/pandas/core/frame.py in apply(self, func, axis, broadcast, raw, reduce, args, **kwds)
   4875                         f, axis,
   4876                         reduce=reduce,
-> 4877                         ignore_failures=ignore_failures)
   4878             else:
   4879                 return self._apply_broadcast(f, axis)

/opt/conda/lib/python3.6/site-packages/pandas/core/frame.py in _apply_standard(self, func, axis, ignore_failures, reduce)
   4971             try:
   4972                 for i, v in enumerate(series_gen):
-> 4973                     results[i] = func(v)
   4974                     keys.append(v.name)
   4975             except Exception as e:

/opt/conda/lib/python3.6/site-packages/pandas/core/frame.py in f(x)
   4852         if kwds or args and not isinstance(func, np.ufunc):
   4853             def f(x):
-> 4854                 return func(x, *args, **kwds)
   4855         else:
   4856             f = func

/opt/conda/lib/python3.6/site-packages/numpy/lib/function_base.py in average(a, axis, weights, returned)
   1126 
   1127     if weights is None:
-> 1128         avg = a.mean(axis)
   1129         scl = avg.dtype.type(a.size/avg.size)
   1130     else:

/opt/conda/lib/python3.6/site-packages/numpy/core/_methods.py in _mean(a, axis, dtype, out, keepdims)
     55 
     56     is_float16_result = False
---> 57     rcount = _count_reduce_items(arr, axis)
     58     # Make this warning show up first
     59     if rcount == 0:

/opt/conda/lib/python3.6/site-packages/numpy/core/_methods.py in _count_reduce_items(arr, axis)
     48     items = 1
     49     for ax in axis:
---> 50         items *= arr.shape[ax]
     51     return items
     52 

TypeError: ('tuple indices must be integers or slices, not Series', 'occurred at index x')
In [18]:
?df.apply
Signature: df.apply(func, axis=0, broadcast=False, raw=False, reduce=None, args=(), **kwds)
Docstring:
Applies function along input axis of DataFrame.

Objects passed to functions are Series objects having index
either the DataFrame's index (axis=0) or the columns (axis=1).
Return type depends on whether passed function aggregates, or the
reduce argument if the DataFrame is empty.

Parameters
----------
func : function
    Function to apply to each column/row
axis : {0 or 'index', 1 or 'columns'}, default 0
    * 0 or 'index': apply function to each column
    * 1 or 'columns': apply function to each row
broadcast : boolean, default False
    For aggregation functions, return object of same size with values
    propagated
raw : boolean, default False
    If False, convert each row or column into a Series. If raw=True the
    passed function will receive ndarray objects instead. If you are
    just applying a NumPy reduction function this will achieve much
    better performance
reduce : boolean or None, default None
    Try to apply reduction procedures. If the DataFrame is empty,
    apply will use reduce to determine whether the result should be a
    Series or a DataFrame. If reduce is None (the default), apply's
    return value will be guessed by calling func an empty Series (note:
    while guessing, exceptions raised by func will be ignored). If
    reduce is True a Series will always be returned, and if False a
    DataFrame will always be returned.
args : tuple
    Positional arguments to pass to function in addition to the
    array/series
Additional keyword arguments will be passed as keywords to the function

Notes
-----
In the current implementation apply calls func twice on the
first column/row to decide whether it can take a fast or slow
code path. This can lead to unexpected behavior if func has
side-effects, as they will take effect twice for the first
column/row.

Examples
--------
>>> df.apply(numpy.sqrt) # returns DataFrame
>>> df.apply(numpy.sum, axis=0) # equiv to df.sum(0)
>>> df.apply(numpy.sum, axis=1) # equiv to df.sum(1)

See also
--------
DataFrame.applymap: For elementwise operations
DataFrame.aggregate: only perform aggregating type operations
DataFrame.transform: only perform transformating type operations

Returns
-------
applied : Series or DataFrame
File:      /opt/conda/lib/python3.6/site-packages/pandas/core/frame.py
Type:      method

Comment

By default the group keys are sorted during the groupby operation. You may however pass sort=False to keep keys in the order that they first appear. This will also potential speedup the code.

In [11]:
?pd.DataFrame.apply
Signature: pd.DataFrame.apply(self, func, axis=0, broadcast=False, raw=False, reduce=None, args=(), **kwds)
Docstring:
Applies function along input axis of DataFrame.

Objects passed to functions are Series objects having index
either the DataFrame's index (axis=0) or the columns (axis=1).
Return type depends on whether passed function aggregates, or the
reduce argument if the DataFrame is empty.

Parameters
----------
func : function
    Function to apply to each column/row
axis : {0 or 'index', 1 or 'columns'}, default 0
    * 0 or 'index': apply function to each column
    * 1 or 'columns': apply function to each row
broadcast : boolean, default False
    For aggregation functions, return object of same size with values
    propagated
raw : boolean, default False
    If False, convert each row or column into a Series. If raw=True the
    passed function will receive ndarray objects instead. If you are
    just applying a NumPy reduction function this will achieve much
    better performance
reduce : boolean or None, default None
    Try to apply reduction procedures. If the DataFrame is empty,
    apply will use reduce to determine whether the result should be a
    Series or a DataFrame. If reduce is None (the default), apply's
    return value will be guessed by calling func an empty Series (note:
    while guessing, exceptions raised by func will be ignored). If
    reduce is True a Series will always be returned, and if False a
    DataFrame will always be returned.
args : tuple
    Positional arguments to pass to function in addition to the
    array/series
Additional keyword arguments will be passed as keywords to the function

Notes
-----
In the current implementation apply calls func twice on the
first column/row to decide whether it can take a fast or slow
code path. This can lead to unexpected behavior if func has
side-effects, as they will take effect twice for the first
column/row.

Examples
--------
>>> df.apply(numpy.sqrt) # returns DataFrame
>>> df.apply(numpy.sum, axis=0) # equiv to df.sum(0)
>>> df.apply(numpy.sum, axis=1) # equiv to df.sum(1)

See also
--------
DataFrame.applymap: For elementwise operations
DataFrame.aggregate: only perform aggregating type operations
DataFrame.transform: only perform transformating type operations

Returns
-------
applied : Series or DataFrame
File:      /opt/conda/lib/python3.6/site-packages/pandas/core/frame.py
Type:      function

Comments