Comment¶
The order of elements within each group are preserved (as the original order).
groupby
works exactly the same on index if the index is named.The order of columns in groupby matters if you want unstack the results later.
groupby works on columns too and it can group by some level of a MultiIndex.
groupby
on a Column of An Empty Data Frame¶
import pandas as pd
df = pd.DataFrame({'x': [], 'y': [], 'z': []})
df
df.groupby('x')[['y', 'z']].sum()
groupby
on the Index of An Empty Data Frame¶
import pandas as pd
df = pd.DataFrame({'x': [], 'y': [], 'z': []})
df.set_index('x')
df
df.groupby('x')[['y', 'z']].sum()
groupby
on Non-empty Data Frames¶
import pandas as pd
df = pd.DataFrame(
{
'x': [3, 3, 1, 10, 1, 10],
'y': [1, 2, 3, 4, 5, 6],
'z': [6, 5, 4, 3, 2, 1]
}
)
df
df.groupby('x')[['y', 'z']].sum()
df.groupby('x').sum()
df.groupby('x')[['y', 'z']].sum()
df.groupby(['x'], sort=False).sum()
Aggregation Function Taking Extra Parameters¶
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
'x': [3, 3, 1, 10, 1, 10],
'y': [1, 2, 3, 4, 5, 6],
'z': [6, 5, 4, 3, 2, 1]
}
)
df
def my_min(x, offset=0):
return 0 + min(x)
df.groupby('x')[['y', 'z']].agg(my_min, offset=1000)
df.groupby('x')[['y', 'z']].agg(min)
df.groupby(['x'], sort=False).apply(lambda x: x)
agg¶
Notice that most aggregation functions just ignore NaN!!!
min
on each column inside each group.
df.groupby('x').agg('min')
Multiple aggregations for each column.
df.groupby('x').agg(['min', 'max'])
Aggregate on the column y
only.
df.groupby('x').y.agg(['min', 'max'])
Group by Multiple Criterias¶
When grouping by multiple criterias, you can mix labels and series together.
df.groupby(['x', 'y']).sum()
df.groupby(['x', df.y]).sum()
df.groupby([df.x, df.y]).sum()
Naming Aggreated Columns¶
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
'x': [3, 3, 1, 10, 1, 10],
'y': [1, 2, 3, 4, 5, 6],
'z': [6, 5, 4, 3, 2, 1]
}
)
df
df.groupby('x').agg(
y_avg=('y', np.average),
y_sum=('y', sum),
x_sum=('x', sum),
)
You CANNOT use multiple lambda functions in the aggregate
method as of pandas 0.25.3.
A patch has been made but not released yet.
Before the fix is released,
you just need to define lambda functions as regular named functions to avoid the issue.
df.groupby('x').agg(
y_avg=('y', lambda x: np.average(x)),
y_sum=('y', lambda x: sum(x)),
x_sum=('x', sum),
)
By default, the groupby column is used as the index.
r = df.groupby('x').agg({'y': 'max', 'z': ['max', 'min', 'mean', 'count']})
r
You can have the groupby column as an column in final results
using the option as_index=False
.
r = df.groupby('x',
as_index=False).agg({
'y': 'max',
'z': ['max', 'min', 'mean', 'count']
})
r
r.columns
r.columns = ['x', 'ymax', 'zmax', 'zmin', 'zmean', 'zcnt']
r
Equivalent of Having¶
df.groupby('col').filter
2**0.5
pow(2, 0.5)
s = pd.Series([1, 2, 3])
s
s['abc'] = 1000
s
Aggregation Using apply
¶
df.apply(np.average, args=(None, df.z))
df.drop('z', axis=1)
df.apply(lambda col: np.average(col, weights=df.z))
np.average(df.x, weights=df.z)
np.average(df.y, weights=df.z)
def my_sum(df):
w = df.z / df.z.sum()
return df.apply(np.average, args=
df.groupby('x')[['y', 'z']].apply(my_sum)
def my_sum(df):
w = df.z / df.z.sum()
return df.apply(np.average, args=
df.groupby('x')[['y', 'z']].apply(my_sum)
import numpy as np
df.apply(np.average, args=(df.z, ))
?df.apply
Comment¶
By default the group keys are sorted during the groupby operation. You may however pass sort=False to keep keys in the order that they first appear. This will also potential speedup the code.
?pd.DataFrame.apply