我有以下形式的销售数据:
ID,year,sales,bonus
1,2015,1000,500
1,2015,1400,590
1,2016,1100,200
1,2017,1200,800
1,2017,1700,300
1,2017,1900,510
1,2018,2000,560
1,2018,1700,600
2,2015,2000,400
2,2015,1450,580
2,2015,2100,300
2,2017,1400,770
2,2017,2700,330
2,2018,3900,610
2,2018,2000,530
2,2018,1700,700
3,2015,2900,406
3,2015,1450,580
3,2015,2100,300
3,2017,1450,777
3,2018,3100,330
3,2018,3900,610
3,2019,2000,530
3,2019,1900,730
我想包括两个新的列,以捕捉每年每个ID的销售额和奖金列的第10和第90百分位(我很感激这在这个简化的数据集中没有多大意义(
我使用了以下方法:
def aggs(df, names):
for i in range(0, len(names)):
df[names[i] + '90'] = df[names[i]].quantile(.90)
df[names[i] + '10'] = df[names[i]].quantile(.10)
return df
sales = pd.read_csv("sales.csv")
names = ['sales', 'bonus']
sales2 = sales.groupby(['year', 'ID'], as_index=False)
sales2 = sales2.apply(aggs, names)
生产
ID year sales bonus sales90 sales10 bonus90 bonus10
0 1 2015 1000 500 1360.0 1040.0 581.0 509.0
1 1 2015 1400 590 1360.0 1040.0 581.0 509.0
2 1 2016 1100 200 1100.0 1100.0 200.0 200.0
3 1 2017 1200 800 1860.0 1300.0 742.0 342.0
4 1 2017 1700 300 1860.0 1300.0 742.0 342.0
5 1 2017 1900 510 1860.0 1300.0 742.0 342.0
6 1 2018 2000 560 1970.0 1730.0 596.0 564.0
7 1 2018 1700 600 1970.0 1730.0 596.0 564.0
8 2 2015 2000 400 2080.0 1560.0 544.0 320.0
9 2 2015 1450 580 2080.0 1560.0 544.0 320.0
10 2 2015 2100 300 2080.0 1560.0 544.0 320.0
11 2 2017 1400 770 2570.0 1530.0 726.0 374.0
12 2 2017 2700 330 2570.0 1530.0 726.0 374.0
13 2 2018 3900 610 3520.0 1760.0 682.0 546.0
14 2 2018 2000 530 3520.0 1760.0 682.0 546.0
15 2 2018 1700 700 3520.0 1760.0 682.0 546.0
16 3 2015 2900 406 2740.0 1580.0 545.2 321.2
17 3 2015 1450 580 2740.0 1580.0 545.2 321.2
18 3 2015 2100 300 2740.0 1580.0 545.2 321.2
19 3 2017 1450 777 1450.0 1450.0 777.0 777.0
20 3 2018 3100 330 3820.0 3180.0 582.0 358.0
21 3 2018 3900 610 3820.0 3180.0 582.0 358.0
22 3 2019 2000 530 1990.0 1910.0 710.0 550.0
问题
代码可以根据需要工作,但随着行和列的数量增加,会出现很大的性能问题。我还收到警告:
PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()
有人能建议我如何改进这种方法吗?谢谢
-
df.quantile()
方法可以传递一组要计算的分位数。并且它对索引执行此操作(默认情况下(。- 每个分位数都被添加为索引上的一个级别。所以,把你的groupbyterms(year,ID(作为一个索引,然后拆下它,把它作为一个多级列
-
将列从"多级"更改为"单层"并重命名。
-
merge()
允许在共有列的情况下轻松合并,所以将其与qs.reset_index()
一起使用可以将年份和ID作为列返回。
qs = sales.groupby(['year', 'ID']).quantile([.90, .10]).unstack()
qs.columns = [f'{c}{int(n*100)}' for c, n in qs.columns]
result = sales.merge(qs.reset_index())
以下是中间步骤中的数据帧:
qs = sales.groupby(['year', 'ID']).quantile([.90, .10]).unstack()
# qs is:
sales bonus
0.9 0.1 0.9 0.1
year ID
2015 1 1360.0 1040.0 581.0 509.0
2 2080.0 1560.0 544.0 320.0
3 2740.0 1580.0 545.2 321.2
2016 1 1100.0 1100.0 200.0 200.0
2017 1 1860.0 1300.0 742.0 342.0
2 2570.0 1530.0 726.0 374.0
3 1450.0 1450.0 777.0 777.0
2018 1 1970.0 1730.0 596.0 564.0
2 3520.0 1760.0 682.0 546.0
3 3820.0 3180.0 582.0 358.0
2019 3 1990.0 1910.0 710.0 550.0
qs.columns = [f'{c}{int(n*100)}' for c, n in qs.columns]
# qs is now
sales90 sales10 bonus90 bonus10
year ID
2015 1 1360.0 1040.0 581.0 509.0
2 2080.0 1560.0 544.0 320.0
3 2740.0 1580.0 545.2 321.2
2016 1 1100.0 1100.0 200.0 200.0
2017 1 1860.0 1300.0 742.0 342.0
...
result = sales.merge(qs.reset_index())
# result:
year ID sales bonus sales90 sales10 bonus90 bonus10
0 2015 1 1000 500 1360.0 1040.0 581.0 509.0
1 2015 1 1400 590 1360.0 1040.0 581.0 509.0
2 2015 2 2000 400 2080.0 1560.0 544.0 320.0
3 2015 2 1450 580 2080.0 1560.0 544.0 320.0
4 2015 2 2100 300 2080.0 1560.0 544.0 320.0
5 2015 3 2900 406 2740.0 1580.0 545.2 321.2
6 2015 3 1450 580 2740.0 1580.0 545.2 321.2
7 2015 3 2100 300 2740.0 1580.0 545.2 321.2
8 2016 1 1100 200 1100.0 1100.0 200.0 200.0
9 2017 1 1200 800 1860.0 1300.0 742.0 342.0
10 2017 1 1700 300 1860.0 1300.0 742.0 342.0
11 2017 1 1900 510 1860.0 1300.0 742.0 342.0
12 2017 2 1400 770 2570.0 1530.0 726.0 374.0
13 2017 2 2700 330 2570.0 1530.0 726.0 374.0
14 2017 3 1450 777 1450.0 1450.0 777.0 777.0
15 2018 1 2000 560 1970.0 1730.0 596.0 564.0
16 2018 1 1700 600 1970.0 1730.0 596.0 564.0
17 2018 2 3900 610 3520.0 1760.0 682.0 546.0
18 2018 2 2000 530 3520.0 1760.0 682.0 546.0
19 2018 2 1700 700 3520.0 1760.0 682.0 546.0
20 2018 3 3100 330 3820.0 3180.0 582.0 358.0
21 2018 3 3900 610 3820.0 3180.0 582.0 358.0
22 2019 3 2000 530 1990.0 1910.0 710.0 550.0
23 2019 3 1900 730 1990.0 1910.0 710.0 550.0