GridSearchCV最佳得分偏离基准

我正在通过网格搜索对一些数据进行训练，我注意到最佳分数与测试集的相去甚远

custom_scorer = make_scorer(f1_score, greater_is_better=True,  pos_label=1)
rf_params = {
'max_depth': [20,50,100,150],
'min_samples_split' : [10, 20, 50, 100],
}
rf = RandomForestClassifier(random_state=42)
rf_grid = GridSearchCV(rf, param_grid = rf_params, cv = 5, scoring = custom_scorer)
rf_grid.fit(X_train, y_train)
print( "Best Score: {}".format(rf_grid.best_score_) )
>> Best Score: 0.9616742738181994

当我在测试集上运行时，它看起来像这样：

y_preds = rf_grid.predict(X_test)
print(metrics.classification_report(y_test, y_preds))
precision    recall  f1-score   support
0       0.93      1.00      0.96      2308
1       0.88      0.07      0.13       192
accuracy                           0.93      2500
macro avg       0.90      0.54      0.55      2500
weighted avg       0.92      0.93      0.90      2500

正如你所看到的，正类的F1分数是0.13，这与GridSearchCV上的best_score_非常不同。我知道它们应该是不同的，因为它的数据集不同，但这只是令人困惑。

我在这个测试中尝试了很多变体，包括对少数类进行上采样，增强/减少params。不知道还有什么。

首先，我认为，你应该将你的深度最大减少到20左右(如果你有2个类，大约有3k个样本(，你的RandomForest只是过度填充了你的数据。

其次，如果你有这样一个巨大的类不平衡，你应该在fit()函数中定义class_weight参数(检查sklearn的compute_class_weight()方法(。

此外，可以尝试更宽一点的参数网格，如num_estimators、min_samples_leaf、class_weight、max_samples。

相关内容

最新更新

热门标签：