将列中的 0 替换为熊猫中的分组中位数



>我有一个数据框,如下所示。 这是2016年12月至2018年11月期间两种保健品的销售数据。

product     profit      bougt_date      discount    salary
A         50         2016-12-01      5           25
A         50         2017-01-03      4           20
B         200        2016-12-24      10          100
A         50         2017-01-18      3           0
B         200        2017-01-28      15          80
A         50         2017-01-18      6           15
B         200        2017-01-28      20          0
A         50         2017-04-18      6           0
B         200        2017-12-08      25          0
A         50         2017-11-18      6           20
B         200        2017-08-21      20          90
B         200        2017-12-28      30          110
A         50         2018-03-18      10          0
B         300        2018-06-08      45          100
B         300        2018-09-20      50          60
A         50         2018-11-18      8           45
B         300        2018-11-28      35          0

从上面,我想用列产品的分组中位数替换 0 工资。

解释:

A :  15, 20, 20, 25, 45
So the median = 20.

B :  60, 80, 90, 100, 100, 110
So the median = 95.

预期输出

product     profit      bougt_date      discount    salary
A         50         2016-12-01      5           25
A         50         2017-01-03      4           20
B         200        2016-12-24      10          100
A         50         2017-01-18      3           20
B         200        2017-01-28      15          80
A         50         2017-01-18      6           15
B         200        2017-01-28      20          95
A         50         2017-04-18      6           20
B         200        2017-12-08      25          95
A         50         2017-11-18      6           20
B         200        2017-08-21      20          90
B         200        2017-12-28      30          110
A         50         2018-03-18      10          20
B         300        2018-06-08      45          100
B         300        2018-09-20      50          60
A         50         2018-11-18      8           45
B         300        2018-11-28      35          95

您可以使用pd.Series.mask屏蔽0值来尝试此操作,并在此处使用np.nanmedian

fill_vals = df.salary.mask(df.salary.eq(0)).groupby(df['product']).transform(np.nanmedian)
df.assign(salary = df.salary.mask(df.salary.eq(0), fill_vals))
product  profit  bougt_date  discount  salary
0        A      50  2016-12-01         5      25
1        A      50  2017-01-03         4      20
2        B     200  2016-12-24        10     100
3        A      50  2017-01-18         3      20
4        B     200  2017-01-28        15      80
5        A      50  2017-01-18         6      15
6        B     200  2017-01-28        20      95
7        A      50  2017-04-18         6      20
8        B     200  2017-12-08        25      95
9        A      50  2017-11-18         6      20
10       B     200  2017-08-21        20      90
11       B     200  2017-12-28        30     110
12       A      50  2018-03-18        10      20
13       B     300  2018-06-08        45     100
14       B     300  2018-09-20        50      60
15       A      50  2018-11-18         8      45
16       B     300  2018-11-28        35      95

使用np.where

df['salary'] = (np.where(df['salary']==0,df['salary'].replace(0,np.nan).
groupby(df['product']).transform('median'),df['salary']))

首先使用.groupby.transform列以显示按中位数分组。最后,找到 0 且.loc的工资,并将其设置为等于工资中位数。

#NOTE - the below line of code uses `median` instead of `np.nanmedian`. These will return different results...
#To anyone reading this, please know which one to use according to your situation...
#As you can see the outputs are different between Chester's answer and mine.
df.loc[df['salary'] == 0, 'salary'] = df.groupby('product')['salary'].transform('median')
df

输出:

product profit bougt_date discount salary
0   A   50  2016-12-01  5   25.0
1   A   50  2017-01-03  4   20.0
2   B   200 2016-12-24  10  100.0
3   A   50  2017-01-18  3   17.5
4   B   200 2017-01-28  15  80.0
5   A   50  2017-01-18  6   15.0
6   B   200 2017-01-28  20  80.0
7   A   50  2017-04-18  6   17.5
8   B   200 2017-12-08  25  80.0
9   A   50  2017-11-18  6   20.0
10  B   200 2017-08-21  20  90.0
11  B   200 2017-12-28  30  110.0
12  A   50  2018-03-18  10  17.5
13  B   300 2018-06-08  45  100.0
14  B   300 2018-09-20  50  60.0
15  A   50  2018-11-18  8   45.0
16  B   300 2018-11-28  35  80.0

最新更新