我想在sklearn中执行10倍交叉验证中的recursive feature elimination with cross validation (rfecv)
(即cross_val_predict
或cross_validate
(。
由于rfecv
本身的名称中有一个交叉验证部分,我不清楚如何做到这一点。我目前的代码如下。
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
y = iris.target
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state = 0, class_weight="balanced")
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
rfecv = RFECV(estimator=clf, step=1, cv=k_fold)
请告诉我如何在10-fold cross validation
中使用数据X
和y
以及rfecv
。
如果需要,我很乐意提供更多细节。
要将递归特征消除与预定义的k_fold
结合使用,应该使用RFE
而不是RFECV
:
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
y = iris.target
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
clf = RandomForestClassifier(random_state = 0, class_weight="balanced")
selector = RFE(clf, 5, step=1)
cv_acc = []
for train_index, val_index in k_fold.split(X, y):
selector.fit(X[train_index], y[train_index])
pred = selector.predict(X[val_index])
acc = accuracy_score(y[val_index], pred)
cv_acc.append(acc)
cv_acc
# result:
[1.0,
0.9333333333333333,
0.9333333333333333,
1.0,
0.9333333333333333,
0.9333333333333333,
0.8666666666666667,
1.0,
0.8666666666666667,
0.9333333333333333]
要使用RFE
和执行功能选择,然后使用10倍交叉验证来拟合rf
,以下是您可以执行的方法:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import confusion_matrix
from sklearn.feature_selection import RFE
rf = RandomForestClassifier(random_state = 0, class_weight="balanced")
rfe = RFE(estimator=rf, step=1)
现在通过与RFECV
:拟合来变换原始X
X_new = rfe.fit_transform(X, y)
以下是已排序的功能(只有4个功能没有太大问题(:
rfe.ranking_
# array([2, 3, 1, 1])
现在分成训练和测试数据,并使用GridSearchCV
进行交叉验证和网格搜索(它们通常一起进行(:
X_train, X_test, y_train, y_test = train_test_split(X_new,y,train_size=0.7)
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
param_grid = {
'n_estimators': [5, 10, 15, 20],
'max_depth': [2, 5, 7, 9]
}
grid_clf = GridSearchCV(rf, param_grid, cv=k_fold.split(X_train, y_train))
grid_clf.fit(X_train, y_train)
y_pred = grid_clf.predict(X_test)
confusion_matrix(y_test, y_pred)
array([[17, 0, 0],
[ 0, 11, 0],
[ 0, 3, 14]], dtype=int64)