scikit学习管道中的递归特征消除网格搜索返回错误

我正在尝试使用scikit learn在管道中链接网格搜索和递归特征消除。

GridSearchCV和RFE与"裸"分类器工作良好：

from sklearn.datasets import make_friedman1
from sklearn import feature_selection
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVR
X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
est = SVR(kernel="linear")
selector = feature_selection.RFE(est)
param_grid = dict(estimator__C=[0.1, 1, 10])
clf = GridSearchCV(selector, param_grid=param_grid, cv=10)
clf.fit(X, y)

将分类器放入管道中会返回一个错误：RuntimeError：分类器不公开"coeff_"或"feature_importances_"属性

from sklearn.datasets import make_friedman1
from sklearn import feature_selection
from sklearn import preprocessing
from sklearn import pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVR
X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
est = SVR(kernel="linear")
std_scaler = preprocessing.StandardScaler()
pipe_params = [('std_scaler', std_scaler), ('clf', est)]
pipe = pipeline.Pipeline(pipe_params)
selector = feature_selection.RFE(pipe)
param_grid = dict(estimator__clf__C=[0.1, 1, 10])
clf = GridSearchCV(selector, param_grid=param_grid, cv=10)
clf.fit(X, y)

编辑：

我意识到我不清楚如何描述这个问题。这是一个更清晰的片段：

from sklearn.datasets import make_friedman1
from sklearn import feature_selection
from sklearn import pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVR
X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
# This will work
est = SVR(kernel="linear")
selector = feature_selection.RFE(est)
clf = GridSearchCV(selector, param_grid={'estimator__C': [1, 10]})
clf.fit(X, y)
# This will not work
est = pipeline.make_pipeline(SVR(kernel="linear"))
selector = feature_selection.RFE(est)
clf = GridSearchCV(selector, param_grid={'estimator__svr__C': [1, 10]})
clf.fit(X, y)

正如您所看到的，唯一的区别是将估计器放入管道中。然而，管道隐藏了"coeff_"或"feature_importances_"属性。问题是：

在scikit learn中有没有一种很好的方法来处理这个问题
如果没有，这种行为是否出于任何原因

第2版：

根据@Chris 提供的答案更新的工作片段

from sklearn.datasets import make_friedman1
from sklearn import feature_selection
from sklearn import pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVR

class MyPipe(pipeline.Pipeline):
    def fit(self, X, y=None, **fit_params):
        """Calls last elements .coef_ method.
        Based on the sourcecode for decision_function(X).
        Link: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/pipeline.py
        ----------
        """
        super(MyPipe, self).fit(X, y, **fit_params)
        self.coef_ = self.steps[-1][-1].coef_
        return self

X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
# Without Pipeline
est = SVR(kernel="linear")
selector = feature_selection.RFE(est)
clf = GridSearchCV(selector, param_grid={'estimator__C': [1, 10, 100]})
clf.fit(X, y)
print(clf.grid_scores_)
# With Pipeline
est = MyPipe([('svr', SVR(kernel="linear"))])
selector = feature_selection.RFE(est)
clf = GridSearchCV(selector, param_grid={'estimator__svr__C': [1, 10, 100]})
clf.fit(X, y)
print(clf.grid_scores_)

您在使用管道时遇到问题。

一条管道的工作原理如下：

当您调用.fit（x，y）等时，第一个对象将应用于数据。如果该方法公开了.transform（）方法，则应用该方法，并将该输出用作下一阶段的输入。

管道可以有任何有效的模型作为最终对象，但之前的所有模型都必须公开.transform（）方法。

就像管道一样，你输入数据，管道中的每个对象都会获得以前的输出，并对其进行另一次转换

正如我们所看到的，

http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html#sklearn.feature_selection.RFE.fit_transform

RFE公开了一个转换方法，因此应该包含在管道本身中。例如

some_sklearn_model=RandomForestClassifier()
selector = feature_selection.RFE(some_sklearn_model)
pipe_params = [('std_scaler', std_scaler), ('RFE', rfe),('clf', est)]

您的尝试有一些问题。首先，您正在尝试缩放数据的一部分。想象一下，我有两个分区[1,1]，[10,10]。如果我通过分区的平均值进行归一化，我会丢失第二个分区明显高于平均值的信息。你应该在开始时按比例缩放，而不是在中间。

其次，SVR不实现转换方法，不能将其作为非最终元素合并到管道中。

RFE采用一个适合数据的模型，然后评估每个特征的权重。

编辑：

如果您愿意，可以通过将sklearn管道封装在自己的类中来包含这种行为。我们要做的是，当我们拟合数据时，检索最后的估计器.coeff_方法，并将其以正确的名称本地存储在派生类中。我建议你查看github上的源代码，因为这只是第一次开始，可能需要更多的错误检查等。Sklearn使用了一个名为@if_delegate_has_method的函数装饰器，这将是一个方便添加的东西，以确保方法的通用性。我已经运行了这段代码以确保它能正常运行，但仅此而已。

from sklearn.datasets import make_friedman1
from sklearn import feature_selection
from sklearn import preprocessing
from sklearn import pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVR
class myPipe(pipeline.Pipeline):
    def fit(self, X,y):
        """Calls last elements .coef_ method.
        Based on the sourcecode for decision_function(X).
        Link: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/pipeline.py
        ----------
        """
        super(myPipe, self).fit(X,y)
        self.coef_=self.steps[-1][-1].coef_
        return
X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
est = SVR(kernel="linear")
selector = feature_selection.RFE(est)
std_scaler = preprocessing.StandardScaler()
pipe_params = [('std_scaler', std_scaler),('select', selector), ('clf', est)]
pipe = myPipe(pipe_params)

selector = feature_selection.RFE(pipe)
clf = GridSearchCV(selector, param_grid={'estimator__clf__C': [2, 10]})
clf.fit(X, y)
print clf.best_params_

如果有什么不清楚的地方，请询问。

我认为构建管道的方法与管道文档中列出的方法略有不同。

你在找这个吗？

X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
est = SVR(kernel="linear")
std_scaler = preprocessing.StandardScaler()
selector = feature_selection.RFE(est)
pipe_params = [('feat_selection',selector),('std_scaler', std_scaler), ('clf', est)]
pipe = pipeline.Pipeline(pipe_params)
param_grid = dict(clf__C=[0.1, 1, 10])
clf = GridSearchCV(pipe, param_grid=param_grid, cv=2)
clf.fit(X, y)
print clf.grid_scores_

另请参阅这个在管道中组合事物的有用示例。对于RFE对象，我只是使用官方文档来使用SVR估计器构建它——然后我只是将RFE对象放入管道中，方法与使用缩放器和估计器对象相同。

相关内容

最新更新

热门标签：