我创建了一个5列500行的数据框架。数据框通过执行以下Python代码来保存随机整数值:
RandomValues = pd.DataFrame(np.random.randint(0, 100, size=(500, 5)),
columns=['Name', 'State', 'Age', 'Experience', 'Annual Income'])
数据帧如下:
Name State Age Experience Annual Income
0 85 10 16 56 89
1 94 1 87 61 37
2 51 7 37 18 92
3 15 1 62 72 60
4 84 88 1 43 14
... ... ... ... ... ...
495 66 33 67 84 7
496 81 2 55 87 59
497 38 50 40 77 36
498 68 45 37 55 90
499 13 82 84 98 35
我使用标准偏差来寻找"年收入"中的异常值。列。
upper_limit = RandomValues['Annual Income'].mean() + 3 * RandomValues['Annual Income'].std()
lower_limit = RandomValues['Annual Income'].mean() - 3 * RandomValues['Annual Income'].std()
如何使用any()方法找到"年收入"的异常值?RandomValues"dataframe。谢谢你的帮助。谢谢你。
我尝试使用where()方法,以及以下Python代码,但它没有解决问题:highOutliers = RandomValues['Annual Income']
打印(highOutliers)打印(lowOutliers)
其次,我尝试了以下操作,但我得到了一个空列表的系列:
highOutliers = RandomValues.loc[RandomValues['Annual Income'] > upper_limit, 'Annual Income']
lowOutliers = RandomValues.loc[RandomValues['Annual Income'] < lower_limit, 'Annual Income']
print(highOutliers)
print(lowOutliers)
Output:
Series([], Name: Annual Income, dtype: int64)
Series([], Name: Annual Income, dtype: int64)
当您进行这样的比较时,您创建的是boolean
系列,其形状与Annual Income
列相同,但包含True/False值
highOutliers_locations = RandomValues['Annual Income'] > upper_limit
lowOutliers_locations = RandomValues['Annual Income'] < lower_limit
这是计算离群值的有用步骤,但您还没有将数据子集。
要真正地将数据帧子集只包含这些异常值,请使用索引,例如.loc
:
highOutliers = RandomValues.loc[highOutliers_locations, 'Annual Income']
lowOutliers = RandomValues.loc[lowOutliers_locations, 'Annual Income']
或者,在一步中:
highOutliers = RandomValues.loc[
RandomValues['Annual Income'] > upper_limit, 'Annual Income'
]
lowOutliers = RandomValues.loc[
RandomValues['Annual Income'] < lower_limit, 'Annual Income'
]
有关更多信息和示例,请参阅pandas索引和选择数据指南