sklearn StackingClassifer with pipeline



设置:

  • 我有一个数据集,里面有一些NaN
  • 我想拟合一个LogisticRegression,并将这些预测输入HistGradiantBoostingClassifier
  • 我希望HistGradiantBoostingClassifier使用其自己的内部NaN处理

首先,一个Debug类来帮助查看发生了什么

from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np
class Debug(BaseEstimator, TransformerMixin):

def __init__(self, msg='DEBUG'):
self.msg=msg
def transform(self, X):
self.shape = X.shape
print(self.msg)
print(f'Shape: {self.shape}')
print(f'NaN count: {np.count_nonzero(np.isnan(X))}')
return X
def fit(self, X, y=None, **fit_params):
return self

现在我的管道

from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import StackingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
data = load_breast_cancer()
X = data['data']
y = data['target']
X[0, 0] = np.nan   # make a NaN
lr_pipe = make_pipeline(
Debug('lr_pipe START'),
SimpleImputer(),
StandardScaler(),
LogisticRegression()
)
pipe = StackingClassifier(
estimators=[('lr_pipe', lr_pipe)],
final_estimator=HistGradientBoostingClassifier(),
passthrough=True, 
cv=2,
verbose=10
)
pipe.fit(X, y)

应该发生什么

  • LogisticRegression适用于整个数据集,用于后期预测(此处未使用(
  • 为了将特性输入HGB,LogisticRegression需要cross_val_predict,我指定了2个折叠。我应该看到lr_pipe被称为两次,以便生成折叠外预测

实际发生了什么


lr_pipe START
Shape: (569, 30)
NaN count: 1
lr_pipe START
Shape: (284, 30)
NaN count: 0
lr_pipe START
Shape: (285, 30)
NaN count: 1
lr_pipe START
Shape: (285, 30)
NaN count: 1
lr_pipe START
Shape: (284, 30)
NaN count: 0
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s finished

为什么lr_pipe被调用5次?我应该看到它被调用3次。

实际上,lr_pipefit()函数调用了3次,但transform()函数调用了5次。您可以通过在fit()函数中添加print()来查看它。

根据StackingClassifier:文件

注意,estimators_安装在完整的X上,而final_estimator_使用CCD_ 14使用基本估计器的交叉验证预测来训练。

当您的estimator适配在完整的X上时,transform()被调用一次,但要适配final_estimatortransform()被调用2*2次(对于两个折叠中的训练集和验证集(。

最新更新