向随机森林回归器添加交叉验证以查看特征重要性



我有以下代码执行随机森林回归以查看特征重要性。我想做交叉验证或k折叠。这是我做回归的代码,它给了我特征和它们的等级。我已经尝试转换我在网上找到的一些代码来添加交叉验证,但到目前为止还没有成功。什么好主意吗?我没有把数据分成测试/训练集。

df = pd.read_csv(dataset_path + file_name)
X = df.drop(['target'], axis = 1)
y= df['target']
clf = RandomForestRegressor(random_state =  42, n_jobs=-1)
# Train model
model = clf.fit(X, y)
feat_importances = pd.DataFrame(model.feature_importances_, index = X.columns, columns=["Importance"])
feat_importances.sort_values(by='Importance', ascending=False, inplace=True)
feat_importances.plot(kind='bar', figsize=(8,6))

您可以考虑使用Kfold.split()。这将随机将数据分成k折叠(如交叉验证所做的那样),然后获得训练和测试的指数值。

你的代码看起来像这样:


importances_per_fold = []
CV = KFold(n_splits=5, shuffle=True, random_state=10)
ix_training, ix_test = [], []
# Loop through each fold and append the training & test indices to the empty lists above
for fold in CV.split(df):
ix_training.append(fold[0]), ix_test.append(fold[1])
for i, (train_outer_ix, test_outer_ix) in enumerate(zip(ix_training, ix_test)): 

X_train, X_test = X.iloc[train_outer_ix, :], X.iloc[test_outer_ix, :]
y_train, y_test = y.iloc[train_outer_ix], y.iloc[test_outer_ix]
clf = RandomForestRegressor(random_state =  42, n_jobs=-1)
# Train model
model = clf.fit(X_train, y_train)
importances_per_fold.append(model.feature_importances_)
# Get mean feature importance across all folds

av_importances = np.mean(importances_per_fold, axis = 0)
feat_importances = pd.DataFrame(av_importances, index = X.columns, columns=["Importance"])
feat_importances.sort_values(by='Importance', ascending=False, inplace=True)
feat_importances.plot(kind='bar', figsize=(8,6))

大部分代码都改编自具有交叉验证的SHAP值的实现。这种评估功能重要性的方法比使用sklearn的内置功能重要性可靠得多。

最新更新