数据列包含新增客户的值。然而,它包含负值,这意味着在特定的一周内没有留下客户。
用正值替换负值的最佳方法是什么?
例如,以前值的平均值
这是的样本
Week New customer added
Week 01 1,881
Week 02 7
Week 03 10
Week 04 10
Week 05 (22)
Week 06 (18)
Week 07 22
Week 08 25
Week 09 3
Week 10 (36)
Week 11 25
Week 12 20
Week 13 (7)
Week 14 (24)
Week 15 10
Week 16 29
Week 17 4
Week 18 (55)
Week 19 (3)
Week 20 17
Week 21 (1,875)
Week 22 1,867
Week 23 (13)
Week 24 47
Week 25 40
Week 26 (1,916)
Week 27 1,891
Week 28 0
Week 29 22
目前的策略是
df['New customer added'].mask(df['New customer added'].lt(0)).bfill().fillna(0)
我正在考虑第二种策略,即用之前2-3个正值的平均值代替负数。
第二种策略的预期输出是
Week New customer added
Week 01 1,881
Week 02 7
Week 03 10
Week 04 10
Week 05 9 # Negative no. replaced by mean of prev 3(i.e. 7,10,10) positive no. i.e. 9
Week 06 9 # Negative no. replaced by mean of prev 3(i.e. 7,10,10) positive no. i.e. 9
使用:
#is sample data are ,, so replace to emoty string
df['New customer added'] = df['New customer added'].astype(str).str.replace(',','')
#negatives are in sample with (
m = df['New customer added'].astype(str).str.contains('(')
解决方案只过滤正行,并使用mean
滚动,最后通过原始索引使用DataFrame.reindex
,并向前填充缺失值(对于负行(
new = (df.loc[~m, 'New customer added'].astype(float)
.rolling(3, min_periods=1).mean()
.reindex(df.index, method='ffill'))
最后只替换底片:
df['New customer added'] = df['New customer added'].mask(m, new)
print (df)
Week New customer added
0 Week 01 1881
1 Week 02 7
2 Week 03 10
3 Week 04 10
4 Week 05 9.0
5 Week 06 9.0
6 Week 07 22
7 Week 08 25
8 Week 09 3
9 Week 10 16.666667
10 Week 11 25
11 Week 12 20
12 Week 13 16.0
13 Week 14 16.0
14 Week 15 10
15 Week 16 29
16 Week 17 4
17 Week 18 14.333333
18 Week 19 14.333333
19 Week 20 17
20 Week 21 16.666667
21 Week 22 1867
22 Week 23 629.333333
23 Week 24 47
24 Week 25 40
25 Week 26 651.333333
26 Week 27 1891
27 Week 28 0
28 Week 29 22