基于scikit-learn的递归特征消除和网格搜索



我想用嵌套网格搜索执行递归特征消除,并使用scikit-learn对每个特征子集进行交叉验证。从RFECV文档听起来,使用estimator_params参数支持这种类型的操作:

estimator_params : dict
    Parameters for the external estimator. Useful for doing grid searches.

然而,当我尝试将超参数网格传递给RFECV对象

from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFECV
from sklearn.svm import SVR
X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
estimator = SVR(kernel="linear")
selector = RFECV(estimator, step=1, cv=5, estimator_params={'C': [0.1, 10, 100, 1000]})
selector = selector.fit(X, y)

我得到一个类似

的错误
  File "U:/My Documents/Code/ModelFeatures/bin/model_rcc_gene_features.py", line 130, in <module>
    selector = selector.fit(X, y)
  File "C:Python27libsite-packagessklearnfeature_selectionrfe.py", line 336, in fit
    ranking_ = rfe.fit(X_train, y_train).ranking_
  File "C:Python27libsite-packagessklearnfeature_selectionrfe.py", line 146, in fit
    estimator.fit(X[:, features], y)
  File "C:Python27libsite-packagessklearnsvmbase.py", line 178, in fit
    fit(X, y, sample_weight, solver_type, kernel, random_seed=seed)
  File "C:Python27libsite-packagessklearnsvmbase.py", line 233, in _dense_fit
    max_iter=self.max_iter, random_seed=random_seed)
  File "libsvm.pyx", line 59, in sklearn.svm.libsvm.fit (sklearnsvmlibsvm.c:1628)
TypeError: a float is required

如果有人能告诉我我做错了什么,我会非常感激,谢谢!

编辑:

在Andreas的回答之后,事情变得更清楚了,下面是RFECV与网格搜索相结合的工作示例。

from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFECV
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVR
X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
param_grid = [{'C': 0.01}, {'C': 0.1}, {'C': 1.0}, {'C': 10.0}, {'C': 100.0}, {'C': 1000.0}, {'C': 10000.0}]
estimator = SVR(kernel="linear")
selector = RFECV(estimator, step=1, cv=4)
clf = GridSearchCV(selector, {'estimator_params': param_grid}, cv=7)
clf.fit(X, y)
clf.best_estimator_.estimator_
clf.best_estimator_.grid_scores_
clf.best_estimator_.ranking_

不幸的是,RFECV仅限于交叉验证组件的数量。你不能用它搜索支持向量机的参数。这个错误是因为SVC期望一个浮点数作为C,而你给了它一个列表。

你可以做两件事中的一件:在RFECV上运行GridSearchCV,这将导致将数据分成两次折叠(一次在GridSearchCV内,一次在RFECV内),但是对组件数量的搜索将是有效的,或者你可以只在RFE上执行GridSearchCV,这将导致数据的单一分裂,但在扫描RFE估计器的参数时效率非常低。

如果你想让文档字符串不那么模棱两可,拉请求将是受欢迎的:)

david提供的代码不适合我(sklearn 0.18),但是需要做一个小的更改来指定param_grid及其用法。

from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFECV
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVR
X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
param_grid = [{'estimator__C': [0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]}]
estimator = SVR(kernel="linear")
selector = RFECV(estimator, step=1, cv=4)
clf = GridSearchCV(selector, param_grid, cv=7)
clf.fit(X, y)
clf.best_estimator_.estimator_
clf.best_estimator_.grid_scores_
clf.best_estimator_.ranking_

相关内容

  • 没有找到相关文章

最新更新