Sklearn不允许
我正在尝试构建一个转换器,它将允许我指定一个功能,然后过滤掉该功能中的任何异常值。异常值是指该特征的值偏离中值超过分布宽度2倍的观测值。
下面是我目前拥有的代码。有三行代码我不确定它们是否正确。如果我做错了,请告诉我,以及如何改正。谢谢
import numpy as np
class FilterOutliersTransformer(base.BaseEstimator, base.TransformerMixin):
def __init__(self, feature):
self.feature = feature
def fit(self, X, y=None):
Q1 = np.percentile(X.loc[:, self.feature], 25)
Q3 = np.percentile(X.loc[:, self.feature], 75)
deviation_allowed = 1.5*(Q3 - Q1)
lower_bound = Q1 - deviation_allowed
upper_bound = Q3 + deviation_allowed
# not sure here 1
self.params_ = [lower_bound, upper_bound]
# not sure here 2
return self
def transform(self, X, y=None):
X_transformed = X[(X[self.feature] > self.params_[0]) & (X[self.feature] < self.params_[1])]
# not sure here 3
return X_transformed
Transformers
更改输出的数据点数量。原因是我们可以选择对目标值(y(应用类似的过滤。
当您计划将此Transformer放置在管道中时,当您在管道末端有分类器/回归器时,就会出现此问题。
from sklearn import datasets
from sklearn import base
import numpy as np
X ,y = datasets.make_classification()
class FilterOutliersTransformer(base.BaseEstimator, base.TransformerMixin):
def __init__(self,):
pass
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
return X[:2]
pipe = pipeline.make_pipeline(FilterOutliersTransformer(),
linear_model.LogisticRegression()).fit(pd.DataFrame(X),y)
你可能会得到
ValueError:发现样本数不一致的输入变量:[98100]
对这样的过程使用Imlearn。请参阅此处以获取类似的示例。
这应该是您想要的。
import numpy as np
class FilterOutliersTransformer(base.BaseEstimator, base.TransformerMixin):
def __init__(self, feature):
self.feature = feature
def fit(self, X, y=None):
Q1 = np.percentile(X.loc[:, self.feature], 25)
Q3 = np.percentile(X.loc[:, self.feature], 75)
deviation_allowed = 1.5*(Q3 - Q1)
lower_bound = Q1 - deviation_allowed
upper_bound = Q3 + deviation_allowed
self.params_ = [lower_bound, upper_bound]
return self
def transform(self, X, y=None):
X_transformed = X[(X[self.feature] > self.params_[0]) & (X[self.feature] < self.params_[1])]
return X[X_transformed[self.feature[0]].notnull()]