假设我有一个数据帧
import pandas as pd
data = pd.DataFrame()
data["name"] = ["A","B","C","D","E","F","G","H","I","J"]
data["age"] = [22,9,505,39,50,17,26,33,-43,48]
data["marks"] = [422,59,75,3,50,47,2,83,63,48]
data
现在,我想从数值变量中删除所有异常值。我可以用1.5+-IQR公式来做。
Q1 = data.age.quantile(0.25)
Q3 = data.age.quantile(0.75)
IQR = Q3 - Q1
d=data.loc[~((data.age < (Q1 - 1.5 * IQR)) | (data.age > (Q3 + 1.5 * IQR))),]
d
我想创建一个用户定义的函数,这样我就可以输入变量的名称,并自动删除异常值。我曾试图编写一个用户定义的函数:
def outlier (data,age):
Q1 = data.age.quantile(0.25)
Q3 = data.age.quantile(0.75)
IQR = Q3 - Q1
data.loc[~((data.age < (Q1 - 1.5 * IQR)) | (data.age > (Q3 + 1.5 * IQR))),]
return data
outlier(data,marks)
然而,据说这些标记没有定义。请帮我解决这个问题。
由于错误表明代码中未定义marks
。您需要通过marks
作为str
例如。outlier(data, "marks")
你也需要改变你的功能,使用不同的列
def outlier(data, col):
Q1 = data[col].quantile(0.25)
Q3 = data[col].quantile(0.75)
IQR = Q3 - Q1
data = data.loc[~((data[col] < (Q1 - 1.5 * IQR)) | (data[col] > (Q3 + 1.5 * IQR))),]
return data
您可以通过计算z分数来完成:
import pandas as pd
def zscore(x):
"""Calculate Z Score."""
return (x - x.mean()) / x.std()
def remove_outliers(data: pd.DataFrame, column):
"""Remove outliers."""
# calculate z-score and set nans to 0
zscores = zscore(data[column])
zscores[zscores.isnull()] = 0
return data.iloc[zscores[(-2 < zscores) & (zscores < 2)].index]
data = pd.DataFrame()
data["name"] = ["A","B","C","D","E","F","G","H","I","J"]
data["age"] = [22,9,505,39,50,17,26,33,-43,48]
data["marks"] = [422,59,75,3,50,47,2,83,63,48]
print(remove_outliers(data, "age"))
输出:
name age marks
0 A 22 422
1 B 9 59
3 D 39 3
4 E 50 50
5 F 17 47
6 G 26 2
7 H 33 83
8 I -43 63
9 J 48 48