自定义转换器以过滤异常值



我正在尝试构建一个转换器,它将允许我指定一个功能,然后过滤掉该功能中的任何异常值。异常值是指该特征的值偏离中值超过分布宽度2倍的观测值。

下面是我目前拥有的代码。有三行代码我不确定它们是否正确。如果我做错了,请告诉我,以及如何改正。谢谢

import numpy as np
class FilterOutliersTransformer(base.BaseEstimator, base.TransformerMixin):

def __init__(self, feature):

self.feature = feature

def fit(self, X, y=None):

Q1 = np.percentile(X.loc[:, self.feature], 25)
Q3 = np.percentile(X.loc[:, self.feature], 75)

deviation_allowed = 1.5*(Q3 - Q1)
lower_bound = Q1 - deviation_allowed
upper_bound = Q3 + deviation_allowed

# not sure here 1
self.params_ = [lower_bound, upper_bound]
# not sure here 2   
return self    

def transform(self, X, y=None):

X_transformed = X[(X[self.feature] > self.params_[0]) & (X[self.feature] < self.params_[1])]

# not sure here 3 
return X_transformed
Sklearn不允许Transformers更改输出的数据点数量。原因是我们可以选择对目标值(y(应用类似的过滤。

当您计划将此Transformer放置在管道中时,当您在管道末端有分类器/回归器时,就会出现此问题。

from sklearn import datasets
from sklearn import base
import numpy as np
X ,y = datasets.make_classification()
class FilterOutliersTransformer(base.BaseEstimator, base.TransformerMixin):

def __init__(self,):

pass

def fit(self, X, y=None):

return self

def transform(self, X, y=None):

return X[:2]

pipe = pipeline.make_pipeline(FilterOutliersTransformer(), 
linear_model.LogisticRegression()).fit(pd.DataFrame(X),y)

你可能会得到

ValueError:发现样本数不一致的输入变量:[98100]

对这样的过程使用Imlearn。请参阅此处以获取类似的示例。

这应该是您想要的。


import numpy as np
class FilterOutliersTransformer(base.BaseEstimator, base.TransformerMixin):

def __init__(self, feature):

self.feature = feature

def fit(self, X, y=None):

Q1 = np.percentile(X.loc[:, self.feature], 25)
Q3 = np.percentile(X.loc[:, self.feature], 75)

deviation_allowed = 1.5*(Q3 - Q1)
lower_bound = Q1 - deviation_allowed
upper_bound = Q3 + deviation_allowed

self.params_ = [lower_bound, upper_bound]

return self

def transform(self, X, y=None):

X_transformed = X[(X[self.feature] > self.params_[0]) & (X[self.feature] < self.params_[1])]

return X[X_transformed[self.feature[0]].notnull()]

最新更新