给出一些假数据:
X = pd.DataFrame( np.random.randint(1,10,28).reshape(14,2) )
y = pd.Series( np.repeat([0,1], [10,4]) ) # imbalanced with more 0s than 1s
我写了一个sklearn fit-transformer,它对y的大多数进行欠采样以匹配少数标签的长度。我想在管道中使用它。
from sklearn.base import BaseEstimator, TransformerMixin
class UnderSampling(BaseEstimator, TransformerMixin):
def fit(self, X, y): # I don't need fit to do anything
return self
def transform(self, X, y):
is_pos = y == 1
idx_pos = y[is_pos].index
random.seed(random_state)
idx_neg = random.sample(y[~is_pos].index, is_pos.sum())
idx = sorted(list(idx_pos) + list(idx_neg))
X_resampled = X.loc[idx]
y_resampled = y.loc[idx]
return X_resampled, y_resampled
def fit_transform(self, X, y):
return self.transform(X,y)
最不幸的是,我不能在管道中使用它。from sklearn.pipeline import make_pipeline
us = UnderSampling()
rfc = RandomForestClassifier()
model = make_pipeline(us, rfc)
model.fit(X,y)
我如何使这个管道工作?
你不应该直接在类上调用估计器方法,你应该在类实例上调用它;这是因为估计器通常具有某种类型的存储状态(例如模型系数):
u = UnderSampling()
a,b = u.fit(X, y)
a,b = u.fit_transform(X, y)