如何在panda中为异常值编写用户定义的函数



假设我有一个数据帧

import pandas as pd
data = pd.DataFrame()
data["name"] = ["A","B","C","D","E","F","G","H","I","J"]
data["age"] = [22,9,505,39,50,17,26,33,-43,48]
data["marks"] = [422,59,75,3,50,47,2,83,63,48]
data

现在,我想从数值变量中删除所有异常值。我可以用1.5+-IQR公式来做。

Q1 = data.age.quantile(0.25)
Q3 = data.age.quantile(0.75)
IQR = Q3 - Q1
d=data.loc[~((data.age < (Q1 - 1.5 * IQR)) | (data.age > (Q3 + 1.5 * IQR))),]
d

我想创建一个用户定义的函数,这样我就可以输入变量的名称,并自动删除异常值。我曾试图编写一个用户定义的函数:

def outlier (data,age):
Q1 = data.age.quantile(0.25)
Q3 = data.age.quantile(0.75)
IQR = Q3 - Q1
data.loc[~((data.age < (Q1 - 1.5 * IQR)) | (data.age > (Q3 + 1.5 * IQR))),]
return data

outlier(data,marks)

然而,据说这些标记没有定义。请帮我解决这个问题。

由于错误表明代码中未定义marks。您需要通过marks作为str

例如。outlier(data, "marks")

你也需要改变你的功能,使用不同的列

def outlier(data, col):
Q1 = data[col].quantile(0.25)
Q3 = data[col].quantile(0.75)
IQR = Q3 - Q1
data = data.loc[~((data[col] < (Q1 - 1.5 * IQR)) | (data[col] > (Q3 + 1.5 * IQR))),]
return data

您可以通过计算z分数来完成:

import pandas as pd
def zscore(x):
"""Calculate Z Score."""
return (x - x.mean()) / x.std()
def remove_outliers(data: pd.DataFrame, column):
"""Remove outliers."""
# calculate z-score and set nans to 0
zscores = zscore(data[column])
zscores[zscores.isnull()] = 0
return data.iloc[zscores[(-2 < zscores) & (zscores < 2)].index]
data = pd.DataFrame()
data["name"] = ["A","B","C","D","E","F","G","H","I","J"]
data["age"] = [22,9,505,39,50,17,26,33,-43,48]
data["marks"] = [422,59,75,3,50,47,2,83,63,48]
print(remove_outliers(data, "age"))

输出:

name  age  marks
0    A   22    422
1    B    9     59
3    D   39      3
4    E   50     50
5    F   17     47
6    G   26      2
7    H   33     83
8    I  -43     63
9    J   48     48

最新更新