Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
Comment¶
The order of elements within each group are preserved (as the original order).
groupbyworks exactly the same on index if the index is named.The order of columns in groupby matters if you want unstack the results later.
groupby works on columns too and it can group by some level of a MultiIndex.
groupby on a Column of An Empty Data Frame¶
import pandas as pd
df = pd.DataFrame({'x': [], 'y': [], 'z': []})
dfdf.groupby('x')[['y', 'z']].sum()groupby on the Index of An Empty Data Frame¶
import pandas as pd
df = pd.DataFrame({'x': [], 'y': [], 'z': []})
df.set_index('x')
dfdf.groupby('x')[['y', 'z']].sum()groupby on Non-empty Data Frames¶
import pandas as pd
df = pd.DataFrame(
{
'x': [3, 3, 1, 10, 1, 10],
'y': [1, 2, 3, 4, 5, 6],
'z': [6, 5, 4, 3, 2, 1]
}
)
dfdf.groupby('x')[['y', 'z']].sum()df.groupby('x').sum()df.groupby('x')[['y', 'z']].sum()df.groupby(['x'], sort=False).sum()Aggregation Function Taking Extra Parameters¶
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
'x': [3, 3, 1, 10, 1, 10],
'y': [1, 2, 3, 4, 5, 6],
'z': [6, 5, 4, 3, 2, 1]
}
)
dfdef my_min(x, offset=0):
return 0 + min(x)df.groupby('x')[['y', 'z']].agg(my_min, offset=1000)df.groupby('x')[['y', 'z']].agg(min)df.groupby(['x'], sort=False).apply(lambda x: x)agg¶
Notice that most aggregation functions just ignore NaN!!!
min on each column inside each group.
df.groupby('x').agg('min')Multiple aggregations for each column.
df.groupby('x').agg(['min', 'max'])Aggregate on the column y only.
df.groupby('x').y.agg(['min', 'max'])Group by Multiple Criterias¶
When grouping by multiple criterias, you can mix labels and series together.
df.groupby(['x', 'y']).sum()df.groupby(['x', df.y]).sum()df.groupby([df.x, df.y]).sum()Naming Aggreated Columns¶
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
'x': [3, 3, 1, 10, 1, 10],
'y': [1, 2, 3, 4, 5, 6],
'z': [6, 5, 4, 3, 2, 1]
}
)
dfdf.groupby('x').agg(
y_avg=('y', np.average),
y_sum=('y', sum),
x_sum=('x', sum),
)You CANNOT use multiple lambda functions in the aggregate method as of pandas 0.25.3.
A patch has been made but not released yet.
Before the fix is released,
you just need to define lambda functions as regular named functions to avoid the issue.
df.groupby('x').agg(
y_avg=('y', lambda x: np.average(x)),
y_sum=('y', lambda x: sum(x)),
x_sum=('x', sum),
)---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-4-859b6ed6abd5> in <module>
2 y_avg=('y', lambda x: np.average(x)),
3 y_sum=('y', lambda x: sum(x)),
----> 4 x_sum=('x', sum),
5 )
/usr/local/lib/python3.7/dist-packages/pandas/core/groupby/generic.py in aggregate(self, arg, *args, **kwargs)
1453 @Appender(_shared_docs["aggregate"])
1454 def aggregate(self, arg=None, *args, **kwargs):
-> 1455 return super().aggregate(arg, *args, **kwargs)
1456
1457 agg = aggregate
/usr/local/lib/python3.7/dist-packages/pandas/core/groupby/generic.py in aggregate(self, func, *args, **kwargs)
262
263 if relabeling:
--> 264 result = result[order]
265 result.columns = columns
266
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py in __getitem__(self, key)
2999 if is_iterator(key):
3000 key = list(key)
-> 3001 indexer = self.loc._convert_to_indexer(key, axis=1, raise_missing=True)
3002
3003 # take() does not accept boolean indexers
/usr/local/lib/python3.7/dist-packages/pandas/core/indexing.py in _convert_to_indexer(self, obj, axis, is_setter, raise_missing)
1283 # When setting, missing keys are not allowed, even with .loc:
1284 kwargs = {"raise_missing": True if is_setter else raise_missing}
-> 1285 return self._get_listlike_indexer(obj, axis, **kwargs)[1]
1286 else:
1287 try:
/usr/local/lib/python3.7/dist-packages/pandas/core/indexing.py in _get_listlike_indexer(self, key, axis, raise_missing)
1090
1091 self._validate_read_indexer(
-> 1092 keyarr, indexer, o._get_axis_number(axis), raise_missing=raise_missing
1093 )
1094 return keyarr, indexer
/usr/local/lib/python3.7/dist-packages/pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis, raise_missing)
1183 if not (self.name == "loc" and not raise_missing):
1184 not_found = list(set(key) - set(ax))
-> 1185 raise KeyError("{} not in index".format(not_found))
1186
1187 # we skip the warning on Categorical/Interval
KeyError: "[('y', '<lambda>')] not in index"By default, the groupby column is used as the index.
r = df.groupby('x').agg({'y': 'max', 'z': ['max', 'min', 'mean', 'count']})
rYou can have the groupby column as an column in final results
using the option as_index=False.
r = df.groupby('x',
as_index=False).agg({
'y': 'max',
'z': ['max', 'min', 'mean', 'count']
})
rr.columnsMultiIndex(levels=[['y', 'z', 'x'], ['count', 'max', 'mean', 'min', '']],
labels=[[2, 0, 1, 1, 1, 1], [4, 1, 1, 3, 2, 0]])r.columns = ['x', 'ymax', 'zmax', 'zmin', 'zmean', 'zcnt']
rEquivalent of Having¶
df.groupby(‘col’).filter
2**0.51.4142135623730951pow(2, 0.5)1.4142135623730951s = pd.Series([1, 2, 3])
s0 1
1 2
2 3
dtype: int64s['abc'] = 1000s0 1
1 2
2 3
abc 1000
dtype: int64Aggregation Using apply¶
df.apply(np.average, args=(None, df.z))x 3.761905
y 2.666667
z 4.333333
dtype: float64df.drop('z', axis=1)df.apply(lambda col: np.average(col, weights=df.z))x 3.761905
y 2.666667
z 4.333333
dtype: float64np.average(df.x, weights=df.z)3.761904761904762np.average(df.y, weights=df.z)2.6666666666666665def my_sum(df):
w = df.z / df.z.sum()
return df.apply(np.average, args=
df.groupby('x')[['y', 'z']].apply(my_sum)def my_sum(df):
w = df.z / df.z.sum()
return df.apply(np.average, args=
df.groupby('x')[['y', 'z']].apply(my_sum)import numpy as np
df.apply(np.average, args=(df.z, ))---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-20-cf60e77863f7> in <module>()
1 import numpy as np
----> 2 df.apply(np.average, args=(df.z,))
/opt/conda/lib/python3.6/site-packages/pandas/core/frame.py in apply(self, func, axis, broadcast, raw, reduce, args, **kwds)
4875 f, axis,
4876 reduce=reduce,
-> 4877 ignore_failures=ignore_failures)
4878 else:
4879 return self._apply_broadcast(f, axis)
/opt/conda/lib/python3.6/site-packages/pandas/core/frame.py in _apply_standard(self, func, axis, ignore_failures, reduce)
4971 try:
4972 for i, v in enumerate(series_gen):
-> 4973 results[i] = func(v)
4974 keys.append(v.name)
4975 except Exception as e:
/opt/conda/lib/python3.6/site-packages/pandas/core/frame.py in f(x)
4852 if kwds or args and not isinstance(func, np.ufunc):
4853 def f(x):
-> 4854 return func(x, *args, **kwds)
4855 else:
4856 f = func
/opt/conda/lib/python3.6/site-packages/numpy/lib/function_base.py in average(a, axis, weights, returned)
1126
1127 if weights is None:
-> 1128 avg = a.mean(axis)
1129 scl = avg.dtype.type(a.size/avg.size)
1130 else:
/opt/conda/lib/python3.6/site-packages/numpy/core/_methods.py in _mean(a, axis, dtype, out, keepdims)
55
56 is_float16_result = False
---> 57 rcount = _count_reduce_items(arr, axis)
58 # Make this warning show up first
59 if rcount == 0:
/opt/conda/lib/python3.6/site-packages/numpy/core/_methods.py in _count_reduce_items(arr, axis)
48 items = 1
49 for ax in axis:
---> 50 items *= arr.shape[ax]
51 return items
52
TypeError: ('tuple indices must be integers or slices, not Series', 'occurred at index x')?df.applySignature: df.apply(func, axis=0, broadcast=False, raw=False, reduce=None, args=(), **kwds)
Docstring:
Applies function along input axis of DataFrame.
Objects passed to functions are Series objects having index
either the DataFrame's index (axis=0) or the columns (axis=1).
Return type depends on whether passed function aggregates, or the
reduce argument if the DataFrame is empty.
Parameters
----------
func : function
Function to apply to each column/row
axis : {0 or 'index', 1 or 'columns'}, default 0
* 0 or 'index': apply function to each column
* 1 or 'columns': apply function to each row
broadcast : boolean, default False
For aggregation functions, return object of same size with values
propagated
raw : boolean, default False
If False, convert each row or column into a Series. If raw=True the
passed function will receive ndarray objects instead. If you are
just applying a NumPy reduction function this will achieve much
better performance
reduce : boolean or None, default None
Try to apply reduction procedures. If the DataFrame is empty,
apply will use reduce to determine whether the result should be a
Series or a DataFrame. If reduce is None (the default), apply's
return value will be guessed by calling func an empty Series (note:
while guessing, exceptions raised by func will be ignored). If
reduce is True a Series will always be returned, and if False a
DataFrame will always be returned.
args : tuple
Positional arguments to pass to function in addition to the
array/series
Additional keyword arguments will be passed as keywords to the function
Notes
-----
In the current implementation apply calls func twice on the
first column/row to decide whether it can take a fast or slow
code path. This can lead to unexpected behavior if func has
side-effects, as they will take effect twice for the first
column/row.
Examples
--------
>>> df.apply(numpy.sqrt) # returns DataFrame
>>> df.apply(numpy.sum, axis=0) # equiv to df.sum(0)
>>> df.apply(numpy.sum, axis=1) # equiv to df.sum(1)
See also
--------
DataFrame.applymap: For elementwise operations
DataFrame.aggregate: only perform aggregating type operations
DataFrame.transform: only perform transformating type operations
Returns
-------
applied : Series or DataFrame
File: /opt/conda/lib/python3.6/site-packages/pandas/core/frame.py
Type: method
Comment¶
By default the group keys are sorted during the groupby operation. You may however pass sort=False to keep keys in the order that they first appear. This will also potential speedup the code.
?pd.DataFrame.applySignature: pd.DataFrame.apply(self, func, axis=0, broadcast=False, raw=False, reduce=None, args=(), **kwds)
Docstring:
Applies function along input axis of DataFrame.
Objects passed to functions are Series objects having index
either the DataFrame's index (axis=0) or the columns (axis=1).
Return type depends on whether passed function aggregates, or the
reduce argument if the DataFrame is empty.
Parameters
----------
func : function
Function to apply to each column/row
axis : {0 or 'index', 1 or 'columns'}, default 0
* 0 or 'index': apply function to each column
* 1 or 'columns': apply function to each row
broadcast : boolean, default False
For aggregation functions, return object of same size with values
propagated
raw : boolean, default False
If False, convert each row or column into a Series. If raw=True the
passed function will receive ndarray objects instead. If you are
just applying a NumPy reduction function this will achieve much
better performance
reduce : boolean or None, default None
Try to apply reduction procedures. If the DataFrame is empty,
apply will use reduce to determine whether the result should be a
Series or a DataFrame. If reduce is None (the default), apply's
return value will be guessed by calling func an empty Series (note:
while guessing, exceptions raised by func will be ignored). If
reduce is True a Series will always be returned, and if False a
DataFrame will always be returned.
args : tuple
Positional arguments to pass to function in addition to the
array/series
Additional keyword arguments will be passed as keywords to the function
Notes
-----
In the current implementation apply calls func twice on the
first column/row to decide whether it can take a fast or slow
code path. This can lead to unexpected behavior if func has
side-effects, as they will take effect twice for the first
column/row.
Examples
--------
>>> df.apply(numpy.sqrt) # returns DataFrame
>>> df.apply(numpy.sum, axis=0) # equiv to df.sum(0)
>>> df.apply(numpy.sum, axis=1) # equiv to df.sum(1)
See also
--------
DataFrame.applymap: For elementwise operations
DataFrame.aggregate: only perform aggregating type operations
DataFrame.transform: only perform transformating type operations
Returns
-------
applied : Series or DataFrame
File: /opt/conda/lib/python3.6/site-packages/pandas/core/frame.py
Type: function