我有大约 3000 万行的数据帧:
ID DATE STATUS
123 2017-01-04 18:08:56+00:00 True
2017-01-04 18:09:56+00:00 True
2017-01-06 19:12:30+00:00 False
2017-01-07 19:12:30+00:00 False
2017-01-08 19:12:30+00:00 False
401 2017-01-01 18:08:56+00:00 False
2017-01-02 18:19:56+00:00 True
2017-01-05 09:15:30+00:00 True
2017-01-06 12:12:30+00:00 False
我想逐行计算每个 ID 的 True 值的百分比。该值应根据当前行的状态增加或减少。
我不擅长描述,但结果应该类似于这样:
ID DATE STATUS VALUE(%)
123 2017-01-04 18:08:56+00:00 True 100
2017-01-04 18:09:56+00:00 True 100
2017-01-06 19:12:30+00:00 False 66.66
2017-01-06 19:12:30+00:00 False 50
2017-01-06 19:12:30+00:00 False 40
401 2017-01-04 18:08:56+00:00 False 0
2017-01-04 18:09:56+00:00 True 50
2017-01-06 19:12:30+00:00 True 66.66
2017-01-06 19:12:30+00:00 False 50
将GroupBy.cumsum
除以GroupBy.cumcount
,乘以100
,必要时四舍五入:
a = df.groupby(level=0)['STATUS'].cumsum()
b = df.groupby(level=0).cumcount() + 1
df['Val'] = a.div(b).mul(100).round(2)
print (df)
STATUS Val
ID DATE
123 2017-01-04 18:08:56+00:00 True 100.00
2017-01-04 18:09:56+00:00 True 100.00
2017-01-06 19:12:30+00:00 False 66.67
2017-01-07 19:12:30+00:00 False 50.00
2017-01-08 19:12:30+00:00 False 40.00
401 2017-01-01 18:08:56+00:00 False 0.00
2017-01-02 18:19:56+00:00 True 50.00
2017-01-05 09:15:30+00:00 True 66.67
2017-01-06 12:12:30+00:00 False 50.00
你也可以这样做
df['p'] = df.groupby('ID')['STATUS'].transform(lambda x: x.cumsum()/(x.index-x.index.min()+1)).mul(100)
df1.groupby('ID').apply(lambda dd:dd.reset_index(drop=True)
.assign(col2=lambda dd:dd.STATUS.cumsum()/(dd.index+1)*100))
.droplevel([0]).pipe(print)
ID DATE STATUS col2
0 123 2017-01-04 18:08:56+00:00 True 100.000000
1 123 2017-01-04 18:09:56+00:00 True 100.000000
2 123 2017-01-06 19:12:30+00:00 False 66.666667
3 123 2017-01-07 19:12:30+00:00 False 50.000000
4 123 2017-01-08 19:12:30+00:00 False 40.000000
0 401 2017-01-01 18:08:56+00:00 False 0.000000
1 401 2017-01-02 18:19:56+00:00 True 50.000000
2 401 2017-01-05 09:15:30+00:00 True 66.666667
3 401 2017-01-06 12:12:30+00:00 False 50.000000