我想使用scikit learn的GridSearchCV
执行网格搜索,并使用预定义的开发和验证拆分(1倍交叉验证)计算交叉验证错误。
恐怕我做错了什么,因为我的验证准确率高得令人怀疑。我认为我错在哪里:我把我的训练数据分成开发和验证集,在开发集上训练,并在验证集上记录交叉验证分数。我的准确性可能会被夸大,因为我真的在开发和验证集的混合上进行训练,然后在验证集上进行测试。我不确定我是否正确使用了scikit learn的PredefinedSplit
模块。详细信息如下:
根据这个答案,我做了以下操作:
import numpy as np
from sklearn.model_selection import train_test_split, PredefinedSplit
from sklearn.grid_search import GridSearchCV
# I split up my data into training and test sets.
X_train, X_test, y_train, y_test = train_test_split(
data[training_features], data[training_response], test_size=0.2, random_state=550)
# sanity check - dimensions of training and test splits
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
# dimensions of X_train and x_test are (323430, 26) and (323430,1) respectively
# dimensions of X_test and y_test are (80858, 26) and (80858, 1)
''' Now, I define indices for a pre-defined split.
this is a 323430 dimensional array, where the indices for the development
set are set to -1, and the indices for the validation set are set to 0.'''
validation_idx = np.repeat(-1, y_train.shape)
np.random.seed(550)
validation_idx[np.random.choice(validation_idx.shape[0],
int(round(.2*validation_idx.shape[0])), replace = False)] = 0
# Now, create a list which contains a single tuple of two elements,
# which are arrays containing the indices for the development and
# validation sets, respectively.
validation_split = list(PredefinedSplit(validation_idx).split())
# sanity check
print(len(validation_split[0][0])) # outputs 258744
print(len(validation_split[0][0]))/float(validation_idx.shape[0])) # outputs .8
print(validation_idx.shape[0] == y_train.shape[0]) # True
print(set(validation_split[0][0]).intersection(set(validation_split[0][1]))) # set([])
现在,我使用GridSearchCV
运行网格搜索。我的意图是,对于网格上的每个参数组合,模型将适合在开发集上,并且当将结果估计器应用于验证集时,将记录交叉验证分数。
# a vanilla XGboost model
model1 = XGBClassifier()
# create a parameter grid for the number of trees and depth of trees
n_estimators = range(300, 1100, 100)
max_depth = [8, 10]
param_grid = dict(max_depth=max_depth, n_estimators=n_estimators)
# A grid search.
# NOTE: I'm passing a PredefinedSplit object as an argument to the `cv` parameter.
grid_search = GridSearchCV(model1, param_grid,
scoring='neg_log_loss',
n_jobs=-1,
cv=validation_split,
verbose=1)
现在,这是我的一个危险信号。我使用网格搜索找到的最佳估计量来找到验证集的准确性。它很高——0.89207865689639176
。更糟糕的是,如果我在数据开发集(我刚刚在其上训练)-0.89295597192591902
上使用分类器,它与我获得的精度几乎相同但是-当我在真实测试集上使用分类器时,我得到的准确度要低得多,大致为.78
:
# accurracy score on the validation set. This yields .89207865
accuracy_score(y_pred =
grid_result2.predict(X_train.iloc[validation_split[0][1]]),
y_true=y_train[validation_split[0][1]])
# accuracy score when applied to the development set. This yields .8929559
accuracy_score(y_pred =
grid_result2.predict(X_train.iloc[validation_split[0][0]]),
y_true=y_train[validation_split[0][0]])
# finally, the score when applied to the test set. This yields .783
accuracy_score(y_pred = grid_result2.predict(X_test), y_true = y_test)
对我来说,当应用于开发和验证数据集时,模型的准确性与应用于测试集时准确性的显著损失之间几乎完全对应,这清楚地表明我是在偶然地对验证数据进行训练,因此我的交叉验证分数不能代表模型的真实准确性。
我似乎找不到哪里出了问题——主要是因为当GridSearchCV
接收到一个PredefinedSplit
对象作为cv
参数的参数时,我不知道它在幕后做什么。
你知道我哪里错了吗?如果你需要更多的细节,请告诉我。代码也在github上的这个笔记本中。
谢谢!
您需要设置refit=False
(不是默认选项),否则网格搜索将在网格搜索完成后在整个数据集上重新调整估计器(忽略cv)。
是的,验证数据存在数据泄漏问题。您需要为GridSearchCV
设置refit = False
,它不会重新调整整个数据,包括训练和验证数据。