提高groupby的效率,并应用自定义函数来提高性能



我有以下形式的销售数据:

ID,year,sales,bonus
1,2015,1000,500
1,2015,1400,590
1,2016,1100,200
1,2017,1200,800
1,2017,1700,300
1,2017,1900,510
1,2018,2000,560
1,2018,1700,600
2,2015,2000,400
2,2015,1450,580
2,2015,2100,300
2,2017,1400,770
2,2017,2700,330
2,2018,3900,610
2,2018,2000,530
2,2018,1700,700
3,2015,2900,406
3,2015,1450,580
3,2015,2100,300
3,2017,1450,777
3,2018,3100,330
3,2018,3900,610
3,2019,2000,530
3,2019,1900,730

我想包括两个新的列,以捕捉每年每个ID的销售额和奖金列的第10和第90百分位(我很感激这在这个简化的数据集中没有多大意义(

我使用了以下方法:

def aggs(df, names):
for i in range(0, len(names)):
df[names[i] + '90'] = df[names[i]].quantile(.90)
df[names[i] + '10'] = df[names[i]].quantile(.10)
return df
sales = pd.read_csv("sales.csv")
names = ['sales', 'bonus']
sales2 = sales.groupby(['year', 'ID'], as_index=False)
sales2 = sales2.apply(aggs, names)

生产

ID  year    sales   bonus   sales90 sales10 bonus90 bonus10
0   1   2015    1000    500 1360.0  1040.0  581.0   509.0
1   1   2015    1400    590 1360.0  1040.0  581.0   509.0
2   1   2016    1100    200 1100.0  1100.0  200.0   200.0
3   1   2017    1200    800 1860.0  1300.0  742.0   342.0
4   1   2017    1700    300 1860.0  1300.0  742.0   342.0
5   1   2017    1900    510 1860.0  1300.0  742.0   342.0
6   1   2018    2000    560 1970.0  1730.0  596.0   564.0
7   1   2018    1700    600 1970.0  1730.0  596.0   564.0
8   2   2015    2000    400 2080.0  1560.0  544.0   320.0
9   2   2015    1450    580 2080.0  1560.0  544.0   320.0
10  2   2015    2100    300 2080.0  1560.0  544.0   320.0
11  2   2017    1400    770 2570.0  1530.0  726.0   374.0
12  2   2017    2700    330 2570.0  1530.0  726.0   374.0
13  2   2018    3900    610 3520.0  1760.0  682.0   546.0
14  2   2018    2000    530 3520.0  1760.0  682.0   546.0
15  2   2018    1700    700 3520.0  1760.0  682.0   546.0
16  3   2015    2900    406 2740.0  1580.0  545.2   321.2
17  3   2015    1450    580 2740.0  1580.0  545.2   321.2
18  3   2015    2100    300 2740.0  1580.0  545.2   321.2
19  3   2017    1450    777 1450.0  1450.0  777.0   777.0
20  3   2018    3100    330 3820.0  3180.0  582.0   358.0
21  3   2018    3900    610 3820.0  3180.0  582.0   358.0
22  3   2019    2000    530 1990.0  1910.0  710.0   550.0

问题

代码可以根据需要工作,但随着行和列的数量增加,会出现很大的性能问题。我还收到警告:

PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()

有人能建议我如何改进这种方法吗?谢谢

  1. df.quantile()方法可以传递一组要计算的分位数。并且它对索引执行此操作(默认情况下(。

    • 每个分位数都被添加为索引上的一个级别。所以,把你的groupbyterms(year,ID(作为一个索引,然后拆下它,把它作为一个多级列
  2. 将列从"多级"更改为"单层"并重命名。

  3. merge()允许在共有列的情况下轻松合并,所以将其与qs.reset_index()一起使用可以将年份和ID作为列返回。

qs = sales.groupby(['year', 'ID']).quantile([.90, .10]).unstack()
qs.columns = [f'{c}{int(n*100)}' for c, n in qs.columns]
result = sales.merge(qs.reset_index())

以下是中间步骤中的数据帧:

qs = sales.groupby(['year', 'ID']).quantile([.90, .10]).unstack()
# qs is:
sales          bonus
0.9     0.1    0.9    0.1
year ID
2015 1   1360.0  1040.0  581.0  509.0
2   2080.0  1560.0  544.0  320.0
3   2740.0  1580.0  545.2  321.2
2016 1   1100.0  1100.0  200.0  200.0
2017 1   1860.0  1300.0  742.0  342.0
2   2570.0  1530.0  726.0  374.0
3   1450.0  1450.0  777.0  777.0
2018 1   1970.0  1730.0  596.0  564.0
2   3520.0  1760.0  682.0  546.0
3   3820.0  3180.0  582.0  358.0
2019 3   1990.0  1910.0  710.0  550.0
qs.columns = [f'{c}{int(n*100)}' for c, n in qs.columns]
# qs is now
sales90  sales10  bonus90  bonus10
year ID
2015 1    1360.0   1040.0    581.0    509.0
2    2080.0   1560.0    544.0    320.0
3    2740.0   1580.0    545.2    321.2
2016 1    1100.0   1100.0    200.0    200.0
2017 1    1860.0   1300.0    742.0    342.0
...
result = sales.merge(qs.reset_index())
# result:
year  ID  sales  bonus  sales90  sales10  bonus90  bonus10
0   2015   1   1000    500   1360.0   1040.0    581.0    509.0
1   2015   1   1400    590   1360.0   1040.0    581.0    509.0
2   2015   2   2000    400   2080.0   1560.0    544.0    320.0
3   2015   2   1450    580   2080.0   1560.0    544.0    320.0
4   2015   2   2100    300   2080.0   1560.0    544.0    320.0
5   2015   3   2900    406   2740.0   1580.0    545.2    321.2
6   2015   3   1450    580   2740.0   1580.0    545.2    321.2
7   2015   3   2100    300   2740.0   1580.0    545.2    321.2
8   2016   1   1100    200   1100.0   1100.0    200.0    200.0
9   2017   1   1200    800   1860.0   1300.0    742.0    342.0
10  2017   1   1700    300   1860.0   1300.0    742.0    342.0
11  2017   1   1900    510   1860.0   1300.0    742.0    342.0
12  2017   2   1400    770   2570.0   1530.0    726.0    374.0
13  2017   2   2700    330   2570.0   1530.0    726.0    374.0
14  2017   3   1450    777   1450.0   1450.0    777.0    777.0
15  2018   1   2000    560   1970.0   1730.0    596.0    564.0
16  2018   1   1700    600   1970.0   1730.0    596.0    564.0
17  2018   2   3900    610   3520.0   1760.0    682.0    546.0
18  2018   2   2000    530   3520.0   1760.0    682.0    546.0
19  2018   2   1700    700   3520.0   1760.0    682.0    546.0
20  2018   3   3100    330   3820.0   3180.0    582.0    358.0
21  2018   3   3900    610   3820.0   3180.0    582.0    358.0
22  2019   3   2000    530   1990.0   1910.0    710.0    550.0
23  2019   3   1900    730   1990.0   1910.0    710.0    550.0

最新更新