将多索引数据框架中的多个列相加,然后将结果添加到列中



如标题所示,我想对一个多索引df进行分组。检查是否满足某些条件,然后在另一列中计算总分。然后我想根据这个分数来排名。我的解决方案将是相当复杂的,因为我必须重新创建一个兼容的df(例如df。update())。我的问题是由于pandas.core.groupby.GroupBy.sum没有像pandas. datafframe .sum.

那样的axis关键字。最小的例子:

import pandas as pd
from pandas import Timestamp
dd={'aavg': {
(Timestamp('2021-02-03 00:00:00'), 277, 403): 0.0694,
(Timestamp('2021-02-03 00:00:00'), 278, 403): 0.09393,
(Timestamp('2021-02-03 00:00:00'), 277, 415): 0.148286,
(Timestamp('2021-02-03 00:00:00'), 278, 415): 0.118842,
(Timestamp('2021-02-04 00:00:00'), 278, 403): 0.078156,
(Timestamp('2021-02-04 00:00:00'), 277, 415): 0.11675,
(Timestamp('2021-02-04 00:00:00'), 278, 415): 0.126452,
(Timestamp('2021-02-05 00:00:00'), 277, 415): 0.147857,
(Timestamp('2021-02-05 00:00:00'), 278, 415): 0.107894,
(Timestamp('2021-02-05 00:00:00'), 277, 403): 0.1245,
(Timestamp('2021-02-05 00:00:00'), 278, 403): 0.070652,
(Timestamp('2021-02-08 00:00:00'), 277, 415): 0.12475,
(Timestamp('2021-02-08 00:00:00'), 278, 415): 0.11235,
(Timestamp('2021-02-08 00:00:00'), 277, 403): 0.3435,
(Timestamp('2021-02-08 00:00:00'), 278, 403): 0.071826,
(Timestamp('2021-02-09 00:00:00'), 277, 415): 0.091,
(Timestamp('2021-02-09 00:00:00'), 278, 415): 0.116125,
(Timestamp('2021-02-09 00:00:00'), 277, 403): 0.057,
(Timestamp('2021-02-09 00:00:00'), 278, 403): 0.075733},
'acnt': {
(Timestamp('2021-02-03 00:00:00'), 277, 403): 5,
(Timestamp('2021-02-03 00:00:00'), 278, 403): 71,
(Timestamp('2021-02-03 00:00:00'), 277, 415): 7,
(Timestamp('2021-02-03 00:00:00'), 278, 415): 95,
(Timestamp('2021-02-04 00:00:00'), 278, 403): 90,
(Timestamp('2021-02-04 00:00:00'), 277, 415): 4,
(Timestamp('2021-02-04 00:00:00'), 278, 415): 115,
(Timestamp('2021-02-05 00:00:00'), 277, 415): 7,
(Timestamp('2021-02-05 00:00:00'), 278, 415): 123,
(Timestamp('2021-02-05 00:00:00'), 277, 403): 2,
(Timestamp('2021-02-05 00:00:00'), 278, 403): 92,
(Timestamp('2021-02-08 00:00:00'), 277, 415): 4,
(Timestamp('2021-02-08 00:00:00'), 278, 415): 60,
(Timestamp('2021-02-08 00:00:00'), 277, 403): 2,
(Timestamp('2021-02-08 00:00:00'), 278, 403): 46,
(Timestamp('2021-02-09 00:00:00'), 277, 415): 6,
(Timestamp('2021-02-09 00:00:00'), 278, 415): 88,
(Timestamp('2021-02-09 00:00:00'), 277, 403): 3,
(Timestamp('2021-02-09 00:00:00'), 278, 403): 60}}
ndf=pd.DataFrame.from_dict(dd,orient='columns')
ndf.index.set_names(['adate','filter_id','tr_id'],inplace=True)
print(ndf)
d = ndf.sort_index(level=0)
d['aavg_r5'] = d['aavg'].groupby(['filter_id','tr_id'],group_keys=False).rolling(5).mean() #weekly mean (trend)
d['aavg_diff'] = d['aavg'].groupby(['filter_id','tr_id'],group_keys=False).diff() #absolute change
d['aavg_pctc'] = d['aavg'].groupby(['filter_id','tr_id'],group_keys=False).pct_change() #relative change
last = d.index.levels[0].max()
l= d.loc[(last,slice(None),slice(None)),:].dropna()
l['s1']=(l['aavg_diff'].abs() > 0.1).astype(int)
l['s2']=(l['aavg_pctc'].abs() > 0.01).astype(int)
l['s3']=(((l['aavg_r5']-l['aavg'])/l['aavg_r5']).abs() > .1).astype(int)
### sum score (not working)
#print(l.loc[:,['s1','s2','s3']].groupby(['filter_id','tr_id']).sum(axis=1))
### sort by score
#l.sort_values(by=['score'],inplace=True)
for ((date,filter,tr),g) in l.loc[:,['s1','s2','s3']].groupby(['adate','filter_id','tr_id']):
print(g.sum(axis=1))

条件本身是初步的-我需要在未来找到好的值和标准…(如果你有这方面的建议,我不介意:-),但它们不是这个问题的重点)

谢谢你的帮助:-)菲尔。

用assign解决:

#[...]
l['s1']=(l['aavg_diff'].abs() > 0.1).astype(int)
l['s2']=(l['aavg_pctc'].abs() > 0.01).astype(int)
l['s3']=(((l['aavg_r5']-l['aavg'])/l['aavg_r5']).abs() > .1).astype(int)
l=l.assign(score=lambda x: x.s1+x.s2+x.s3)
#[...]

相关内容

最新更新