我正在基于条件函数在数据帧中创建一个新列。在我映射的列中,有多个NaN值。如果NaN值出现在原始列中,我也希望它出现在我的新列中。举个例子,我的出发点是:
Original
0 1
1 2
2 3
3 4
4 5
5 6
6 Nan
7 8
8 9
9 10
以下是我最初运行的代码示例,它(清楚地(给出了以下结果:
def get_value(range):
if range < 2:
return 'Below 2'
elif range < 8:
return 'Between 2 and 8'
else:
return 'Above 8'
df_sample['new_col'] = df_sample.apply(lambda x: get_value(x['Original']), axis=1)
Original new_col
0 1.0 Below 2
1 2.0 Between 2 and 8
2 3.0 Between 2 and 8
3 4.0 Between 2 and 8
4 5.0 Between 2 and 8
5 6.0 Between 2 and 8
6 NaN Above 8
7 8.0 Above 8
8 9.0 Above 8
9 10.0 Above 8
这里,索引6应该显示NaN。
我试过在我的函数中包含elif range==np.Nan:,但没有成功。
然后,我根据Stackoverflow的建议尝试了以下操作:
df_sample['new_col'] = df_sample.apply(lambda x: get_value(x) if(np.all(pd.notnull(x['Original']))) else x, axis = 1)
但这在我的数据帧中的第一个NaN索引处返回了一个错误。
Déjàvu在这里,但根据我的上一个解决方案,只需为不满足条件的地方添加default
:
import numpy as np
condlist = [
df['Original'].lt(2),
df['Original'].lt(8),
df['Original'].ge(8)]
choicelist = ['Below 2', 'Between 2 and 8', 'Above 8']
df['new_col'] = np.select(condlist, choicelist, default=np.nan)
print(df)
[out]
Original new_col
0 1.0 Below 2
1 2.0 Between 2 and 8
2 3.0 Between 2 and 8
3 4.0 Between 2 and 8
4 5.0 Between 2 and 8
5 6.0 Between 2 and 8
6 NaN nan
7 8.0 Above 8
8 9.0 Above 8
9 10.0 Above 8
对您的代码发表评论:
当您使用else语句时,所有不低于8的内容都将显示为"高于8"。即使在原始数据集中有字符串"helloworld"。
要保持代码的简单性,您可以执行以下操作:
def get_value(range):
if range < 2:
return 'Below 2'
elif range < 8:
return 'Between 2 and 8'
elif range >= 8:
return 'Above 8'
else:
return np.nan
通常,不要使用apply
。在这种情况下,cut
是一个更好的选择:
pd.cut(df.Original, [-np.inf, 2, 8, np.inf],
labels = ['below 2', 'between 2 and 8', 'above 8'],
right=False)
输出:
0 below 2
1 between 2 and 8
2 between 2 and 8
3 between 2 and 8
4 between 2 and 8
5 between 2 and 8
6 NaN
7 above 8
8 above 8
9 above 8
Name: Original, dtype: category
Categories (3, object): [below 2 < between 2 and 8 < above 8]