我有一个raw数据集,如下所示:
<表类>
可乐
ColB
时间间隔计数器 tbody><<tr>SD 2 4 1 SD 3 3 2 UD 2 1 10 BUD 1 2 2 BUD 2 2 2 BSD 3 3 13 BSD 1 4 19 表类>
通过复制ColB
来使用DataFrame.pivot_table
和辅助列new
,然后平放MultiIndex
并将输出添加到聚合sum
创建的新DataFrame:
df1 = (df.assign(new=df['ColB'])
.pivot_table(index=['ColA', 'ColB'],
columns='new',
values=['interval','duration'],
fill_value=0,
aggfunc='mean'))
df1.columns = df1.columns.map(lambda x: f'{x[0]}{x[1]}')
df = (df.groupby(['ColA','ColB'])['Counter']
.sum()
.to_frame(name='SumCounter')
.join(df1).reset_index())
print (df)
ColA ColB SumCounter durationSD durationUD intervalSD intervalUD
0 A SD 3 2.5 0.0 3.5 0
1 A UD 10 0.0 2.0 0.0 1
2 B SD 32 2.0 0.0 3.5 0
3 B UD 4 0.0 1.5 0.0 2
您可以尝试按A
列分组和按B
列分组,Named Aggregation
out = df.groupby('ColA').apply(lambda g: g.groupby('ColB').agg({'duration': [(f'{g["ColB"].iloc[0]}', 'mean')],
'interval': [(f'{g["ColB"].iloc[0]}', 'mean')],
'Counter': 'sum'})).fillna(0)
print(out)
duration interval Counter duration interval
SD SD sum UD UD
ColA ColB
A SD 2.5 3.5 3 0.0 0.0
UD 2.0 1.0 10 0.0 0.0
B SD 0.0 0.0 32 2.0 3.5
UD 0.0 0.0 4 1.5 2.0
然后重命名多索引列
out.columns = ['SumCounter' if 'Counter' in col[0] else f'Avg{col[0]}{col[1]}' for col in out.columns.values]
print(out)
AvgdurationSD AvgintervalSD SumCounter AvgdurationUD AvgintervalUD
ColA ColB
A SD 2.5 3.5 3 0.0 0.0
UD 2.0 1.0 10 0.0 0.0
B SD 0.0 0.0 32 2.0 3.5
UD 0.0 0.0 4 1.5 2.0
groupby:
temp = (df
.assign(dummy = df.ColB)
.groupby(['ColA','ColB','dummy'])
.agg({'duration':'mean', 'interval':'mean', 'Counter':'sum'})
.rename(columns = {'Counter':'SumCounter'})
.set_index('SumCounter', append = True)
.unstack('dummy', fill_value = 0)
)
temp.columns = temp.columns.map(lambda x: f"Avg{''.join(x)}")
temp.reset_index()
ColA ColB SumCounter AvgdurationSD AvgdurationUD AvgintervalSD AvgintervalUD
0 A SD 3 2.5 0.0 3.5 0.0
1 A UD 10 0.0 2.0 0.0 1.0
2 B SD 32 2.0 0.0 3.5 0.0
3 B UD 4 0.0 1.5 0.0 2.0