我有一个超过 2M 行的大型数据集,其结构如下:
import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.DataFrame({
'name': ['Alex', 'Joe', 'Alex', 'Joe', 'Alex', 'Joe', 'Alex', 'Joe', 'Alex', 'Joe', 'Alex'],
'month': ['May', 'May', 'May', 'May', 'May', 'May', 'April', 'April', 'April', 'April', 'February'],
'variable': ['Cash', 'Cash', 'Debt', 'Debt', 'Others', 'Others', 'Cash', 'Cash', 'Debt', 'Debt', 'Cash'],
'value': np.random.randint(low=0, high=100, size=11)
})
name month variable value
0 Alex May Cash 44
1 Joe May Cash 47
2 Alex May Debt 64
3 Joe May Debt 67
4 Alex May Others 67
5 Joe May Others 9
6 Alex April Cash 83
7 Joe April Cash 21
8 Alex April Debt 36
9 Joe April Debt 87
10 Alex February Cash 70
如果我想计算每个人每个月的净债务,我会这样做:
df.groupby(['name', 'month']).apply(lambda x: x[x['variable'] == 'Debt'].value - x[x['variable'] == 'Cash'].value)
name month
Alex April 6 NaN
8 NaN
February 10 NaN
May 0 NaN
2 NaN
Joe April 7 NaN
9 NaN
May 1 NaN
3 NaN
Name: value, dtype: float64
然而,结果充满了NA值,我相信这是数据帧对每个人和每个月没有相同数量的现金和债务变量的结果。有没有办法让我避免这种情况,并在可能的情况下简单地获得每个月/人的净债务,并在不是的情况下获得 NA?
另外,我对python有点陌生,正如我提到的,我正在处理的数据集非常大 - 所以如果有人知道一种更快/替代的方法,将不胜感激!
IIUC,使用loc
:
print (df.groupby(['name', 'month']).apply(lambda d: d.loc[d["variable"].eq("Debt"),"value"].sum()
- d.loc[d["variable"].eq("Cash"),"value"].sum()))
name month
Alex April 40
February -17
May -25
Joe April 5
May -38
dtype: int64
您可以使用df.pivot_table
然后在此处使用df.sub
。
df_p = df.pivot_table(index=['name','month'], columns='variable',values='value',aggfunc='sum')
df_p['Debt'].sub(df_p['Cash'],fill_value=0)
name month
Alex April -47.0
February -70.0
May 20.0
Joe April 66.0
May 20.0
dtype: float64
或
df_p = df.set_index(['name','month','variable'])['value'].unstack()
df_p['Debt'].sub(df_p['Cash'],fill_value=0)
时间结果
In [41]: %%timeit
...: df_p = df.set_index(['name','month','variable'])['value'].unstack()
...: df_p['Debt'].sub(df_p['Cash'],fill_value=0)
...:
...:
3.13 ms ± 8.91 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [42]: %%timeit
...: df_p = df.pivot_table(index=['name','month'], columns='variable',values='value')
...: df_p['Debt'].sub(df_p['Cash'],fill_value=0)
...:
...:
6.9 ms ± 99.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [43]: %%timeit
...: (df.groupby(['name', 'month']).apply(lambda d: d.loc[d["variable"].eq("Debt"),"value"].sum()
...: - d.loc[d["variable"].eq("Cash"),"value"].sum()))
...:
6.64 ms ± 421 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)