为什么我的交叉验证始终比列车测试拆分表现更好



我有下面的代码(使用sklearn(,它首先使用训练集进行交叉验证,最后使用测试集进行检查。但是,交叉验证始终表现得更好,如下所示。我是否对训练数据过于拟合?如果是这样的话,最好调整哪些超参数来避免这种情况?

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
#Cross validation
rfc = RandomForestClassifier()
cv = RepeatedKFold(n_splits=10, n_repeats=5)   
scoring = {'accuracy', 'precision', 'recall', 'f1', 'roc_auc' }
scores = cross_validate(rfc, X_train, y_train, scoring=scoring, cv=cv)
print(mean(scores['test_accuracy']),
mean(scores['test_precision']),
mean(scores['test_recall']),
mean(scores['test_f1']),
mean(scores['test_roc_auc'])
)

这给了我:

0.8536558341101569 0.8641939667622551 0.8392201023654705 0.8514895113569482 0.9264002192260914

现在用整个训练+验证集重新训练模型,并用以前从未见过的测试集测试它

RFC=RandomForestClassifier((RFC.fit(X_train,y_train(y_pred=RFC预测(X_test(准确度=准确度_核心(y_test,y_pred(precision=precision_score(y_test,y_pred(recall_score(y_test,y_pred(f1=f1_score(y_test,y_pred(y_pred_roba=RFC预测_proba(X_test([::,1]auc=roc_auc_score(y_test,y_pred_roba(打印(精度,精确回忆起f1,auc)现在它给了我下面的数字,这些数字显然更糟:

0.7809788654060067 0.5113236034222446 0.5044687189672294 0.5078730317420644 0.7589037004728368

我能够用Pima Indians糖尿病数据集重现您的场景。

你在预测指标中看到的差异是不一致的,在某些运行中,你甚至可能会注意到相反的情况,因为这取决于分割过程中X_test的选择——有些情况更容易预测,会给出更好的指标,反之亦然。当交叉验证在旋转中的整个集合上运行预测并聚合这种影响时,单个X_test集合将受到随机分割的影响

为了更好地了解这里发生的事情,我修改了你的实验,分为两个步骤:

1.交叉验证步骤:

我使用了整个X和y集,并按照的方式运行其余的代码

rfc = RandomForestClassifier()
cv = RepeatedKFold(n_splits=10, n_repeats=5)
# cv = KFold(n_splits=10)
scoring = {'accuracy', 'precision', 'recall', 'f1', 'roc_auc'}
scores = cross_validate(rfc, X, y, scoring=scoring, cv=cv)
print(mean(scores['test_accuracy']),
mean(scores['test_precision']),
mean(scores['test_recall']),
mean(scores['test_f1']),
mean(scores['test_roc_auc'])
)

输出:

0.768257006151743 0.6943032069967433 0.593436328663432 0.6357667086829574 0.8221242747913622

2.经典列车试验步骤:

接下来,我运行普通的训练测试步骤,但我使用不同的训练测试分割进行了50次,并对度量进行平均(类似于交叉验证步骤(。

accuracies = []
precisions = []
recalls = []
f1s = []
aucs = []
for i in range(50):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
RFC = RandomForestClassifier()
RFC.fit(X_train, y_train)
y_pred = RFC.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
y_pred_proba = RFC.predict_proba(X_test)[::, 1]
auc = roc_auc_score(y_test, y_pred_proba)
accuracies.append(accuracy)
precisions.append(precision)
recalls.append(recall)
f1s.append(f1)
aucs.append(auc)
print(mean(accuracies),
mean(precisions),
mean(recalls),
mean(f1s),
mean(aucs)
)

输出:

0.7606926406926405 0.7001931059992001 0.5778712922956755 0.6306501622080503 0.8207846633339568

正如预期的那样,预测度量是相似的。然而,交叉验证运行得更快,并使用整个数据集的每个数据点进行给定次数的测试(轮流(。