>我有一个数据框,如下所示。 这是2016年12月至2018年11月期间两种保健品的销售数据。
product profit bougt_date discount salary
A 50 2016-12-01 5 25
A 50 2017-01-03 4 20
B 200 2016-12-24 10 100
A 50 2017-01-18 3 0
B 200 2017-01-28 15 80
A 50 2017-01-18 6 15
B 200 2017-01-28 20 0
A 50 2017-04-18 6 0
B 200 2017-12-08 25 0
A 50 2017-11-18 6 20
B 200 2017-08-21 20 90
B 200 2017-12-28 30 110
A 50 2018-03-18 10 0
B 300 2018-06-08 45 100
B 300 2018-09-20 50 60
A 50 2018-11-18 8 45
B 300 2018-11-28 35 0
从上面,我想用列产品的分组中位数替换 0 工资。
解释:
A : 15, 20, 20, 25, 45
So the median = 20.
B : 60, 80, 90, 100, 100, 110
So the median = 95.
预期输出
product profit bougt_date discount salary
A 50 2016-12-01 5 25
A 50 2017-01-03 4 20
B 200 2016-12-24 10 100
A 50 2017-01-18 3 20
B 200 2017-01-28 15 80
A 50 2017-01-18 6 15
B 200 2017-01-28 20 95
A 50 2017-04-18 6 20
B 200 2017-12-08 25 95
A 50 2017-11-18 6 20
B 200 2017-08-21 20 90
B 200 2017-12-28 30 110
A 50 2018-03-18 10 20
B 300 2018-06-08 45 100
B 300 2018-09-20 50 60
A 50 2018-11-18 8 45
B 300 2018-11-28 35 95
您可以使用pd.Series.mask
屏蔽0
值来尝试此操作,并在此处使用np.nanmedian
。
fill_vals = df.salary.mask(df.salary.eq(0)).groupby(df['product']).transform(np.nanmedian)
df.assign(salary = df.salary.mask(df.salary.eq(0), fill_vals))
product profit bougt_date discount salary
0 A 50 2016-12-01 5 25
1 A 50 2017-01-03 4 20
2 B 200 2016-12-24 10 100
3 A 50 2017-01-18 3 20
4 B 200 2017-01-28 15 80
5 A 50 2017-01-18 6 15
6 B 200 2017-01-28 20 95
7 A 50 2017-04-18 6 20
8 B 200 2017-12-08 25 95
9 A 50 2017-11-18 6 20
10 B 200 2017-08-21 20 90
11 B 200 2017-12-28 30 110
12 A 50 2018-03-18 10 20
13 B 300 2018-06-08 45 100
14 B 300 2018-09-20 50 60
15 A 50 2018-11-18 8 45
16 B 300 2018-11-28 35 95
或
使用np.where
df['salary'] = (np.where(df['salary']==0,df['salary'].replace(0,np.nan).
groupby(df['product']).transform('median'),df['salary']))
首先使用.groupby
并.transform
列以显示按中位数分组。最后,找到 0 且.loc
的工资,并将其设置为等于工资中位数。
#NOTE - the below line of code uses `median` instead of `np.nanmedian`. These will return different results...
#To anyone reading this, please know which one to use according to your situation...
#As you can see the outputs are different between Chester's answer and mine.
df.loc[df['salary'] == 0, 'salary'] = df.groupby('product')['salary'].transform('median')
df
输出:
product profit bougt_date discount salary
0 A 50 2016-12-01 5 25.0
1 A 50 2017-01-03 4 20.0
2 B 200 2016-12-24 10 100.0
3 A 50 2017-01-18 3 17.5
4 B 200 2017-01-28 15 80.0
5 A 50 2017-01-18 6 15.0
6 B 200 2017-01-28 20 80.0
7 A 50 2017-04-18 6 17.5
8 B 200 2017-12-08 25 80.0
9 A 50 2017-11-18 6 20.0
10 B 200 2017-08-21 20 90.0
11 B 200 2017-12-28 30 110.0
12 A 50 2018-03-18 10 17.5
13 B 300 2018-06-08 45 100.0
14 B 300 2018-09-20 50 60.0
15 A 50 2018-11-18 8 45.0
16 B 300 2018-11-28 35 80.0