Pandas:当组变量长度不同时,groupby().apply() 自定义函数?



我有一个超过 2M 行的大型数据集,其结构如下:

import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.DataFrame({
'name': ['Alex', 'Joe', 'Alex', 'Joe', 'Alex', 'Joe', 'Alex', 'Joe', 'Alex', 'Joe', 'Alex'],
'month': ['May', 'May', 'May', 'May', 'May', 'May', 'April', 'April', 'April', 'April', 'February'],
'variable': ['Cash', 'Cash', 'Debt', 'Debt', 'Others', 'Others', 'Cash', 'Cash', 'Debt', 'Debt', 'Cash'],
'value': np.random.randint(low=0, high=100, size=11)
})
name     month variable  value
0   Alex       May     Cash     44
1    Joe       May     Cash     47
2   Alex       May     Debt     64
3    Joe       May     Debt     67
4   Alex       May   Others     67
5    Joe       May   Others      9
6   Alex     April     Cash     83
7    Joe     April     Cash     21
8   Alex     April     Debt     36
9    Joe     April     Debt     87
10  Alex  February     Cash     70

如果我想计算每个人每个月的净债务,我会这样做:

df.groupby(['name', 'month']).apply(lambda x: x[x['variable'] == 'Debt'].value - x[x['variable'] == 'Cash'].value)
name  month
Alex  April     6    NaN
8    NaN
February  10   NaN
May       0    NaN
2    NaN
Joe   April     7    NaN
9    NaN
May       1    NaN
3    NaN
Name: value, dtype: float64 

然而,结果充满了NA值,我相信这是数据帧对每个人和每个月没有相同数量的现金和债务变量的结果。有没有办法让我避免这种情况,并在可能的情况下简单地获得每个月/人的净债务,并在不是的情况下获得 NA?

另外,我对python有点陌生,正如我提到的,我正在处理的数据集非常大 - 所以如果有人知道一种更快/替代的方法,将不胜感激!

IIUC,使用loc

print (df.groupby(['name', 'month']).apply(lambda d: d.loc[d["variable"].eq("Debt"),"value"].sum() 
- d.loc[d["variable"].eq("Cash"),"value"].sum()))
name  month   
Alex  April       40
February   -17
May        -25
Joe   April        5
May        -38
dtype: int64

您可以使用df.pivot_table然后在此处使用df.sub

df_p = df.pivot_table(index=['name','month'], columns='variable',values='value',aggfunc='sum')
df_p['Debt'].sub(df_p['Cash'],fill_value=0)
name  month
Alex  April      -47.0
February   -70.0
May         20.0
Joe   April       66.0
May         20.0
dtype: float64

df_p = df.set_index(['name','month','variable'])['value'].unstack()
df_p['Debt'].sub(df_p['Cash'],fill_value=0)

时间结果

In [41]: %%timeit
...: df_p = df.set_index(['name','month','variable'])['value'].unstack()
...: df_p['Debt'].sub(df_p['Cash'],fill_value=0)
...:
...:
3.13 ms ± 8.91 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [42]: %%timeit
...: df_p = df.pivot_table(index=['name','month'], columns='variable',values='value')
...: df_p['Debt'].sub(df_p['Cash'],fill_value=0)
...:
...:
6.9 ms ± 99.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [43]: %%timeit
...: (df.groupby(['name', 'month']).apply(lambda d: d.loc[d["variable"].eq("Debt"),"value"].sum()
...:                                                      - d.loc[d["variable"].eq("Cash"),"value"].sum()))
...:
6.64 ms ± 421 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

最新更新