使用GridSearchCV和RandomForestClassifier的问题使用大数据，总是显示召回分数=1，所以最

这是我的第一个StackOverflow问题，我需要帮助！我自己和通过实验详尽地寻找答案，但我希望社区里的人能提供帮助。

这是我在大学的论文，如果有任何帮助，我们将不胜感激。

我将尽可能最好地总结：

我正在使用Scikit学习分类器，并尝试使用GridSearchCV对其进行调优/CV，以形成未来使用Keras/Tensorflow的基线
我目前的问题在于RandomForestClassifier/GridSearchCV
我正在使用大量数据。Kaggle的信用卡欺诈数据
数据不平衡，因此我使用SMOTE进行过采样，以便0类和1类的训练分割相等(欺诈)。每个大约20万

现在来解释问题：

当我在RandomForestClassifier上为这些数据运行GridSearchCV时，召回分数总是=1。这意味着没有特定的参数被选为"最佳"。我也不明白为什么这总是1。这大约需要6-8个小时才能运行，因此如果每次迭代的召回率都为1，那么这将变得毫无意义。
然而，当我简单地对数据进行单一拟合(没有GridsearchCV)并进行预测测试时。我得到了大约80-84%的分数结果(再次对Recall感兴趣)。这当然更现实

我的想法/实验：

我尝试对数据进行低采样，每类492个，这在每次GSCV迭代中给出了大约90%的数据。看起来更好，但仍明显高于平均水平
还尝试了不同的训练集大小(50000、100000、…)，它们也都为每次迭代给出了回忆=1

我的猜测是，关于为什么会发生这种情况，有太多的数据/过拟合/一些事情。或者，我认为Gridsearch采用的是总体/非欺诈分类指标，在这些情况下接近1。

以下是在｛0:200000，1:200000｝训练集上运行GSCV的输出图片：GSCV每次迭代召回=1正如你所看到的，每个倍数的分数=1，但当之后用模型进行测试/预测时，我们在分类报告中得到了一个看似有效的80%的指标。

我知道这套测试是相当少的欺诈案件(只有几百起)。但这是因为我只对训练数据进行了过采样，以保持新的(看不见的)测试数据。

因此，通过查看分类报告，我认为GridSearchCV可能采用了错误的值(即，我们对class＝1指标感兴趣)。然而，从文档来看，Pos_label=1是skikit learn中记分器的默认值。所以这不应该是问题所在。

我尝试过自定义记分器/默认记分器等

这是我的代码(有点乱，但应该清楚发生了什么！注意注释掉的单个RF分类器，没有GridSearch)：

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import itertools
data = pd.read_csv("creditcard.csv")
# Normalise and reshape the Amount column, so it's values lie between -1 and 1
from sklearn.preprocessing import StandardScaler
data['norm_Amount'] = StandardScaler().fit_transform(data['Amount'].reshape(-1,1))
# Drop the old Amount column and also the Time column as we don't want to include this at this stage
data = data.drop(['Time', 'Amount'], axis=1)
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, KFold, cross_val_score, GridSearchCV
from sklearn.metrics import confusion_matrix,precision_recall_curve,auc,roc_auc_score,roc_curve,recall_score,classification_report 
########################################################
# MODEL SETUP
# Assign variables x and y corresponding to row data and it's class value
X = data.ix[:, data.columns != 'Class']
y = data.ix[:, data.columns == 'Class']
# Whole dataset, training-test data splitting
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state = 0)
from collections import Counter
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=1)
X_res, y_res = sm.fit_sample(X_train, y_train)
print('Original dataset shape {}'.format(Counter(data['Class'])))
print('Training dataset shape {}'.format(Counter(y_train['Class'])))
print('Resampled training dataset shape {}'.format(Counter(y_res)))

print 'Random Forest: '
from sklearn.ensemble import RandomForestClassifier
# rf = RandomForestClassifier(n_estimators=250, criterion="gini", max_features=3, max_depth=10)
rf = RandomForestClassifier()
param_grid = { "n_estimators"      : [250, 500, 750],
"criterion"         : ["gini", "entropy"],
"max_features"      : [3, 5]}
from sklearn.metrics import recall_score, make_scorer
scorer = make_scorer(recall_score, pos_label=1)

grid_search = GridSearchCV(rf, param_grid, n_jobs=1, cv=3, scoring=scorer, verbose=50)
grid_search.fit(X_res, y_res)
print grid_search.best_params_, grid_search.best_estimator_
# rf.fit(X_res, y_res)
# y_pred = rf.predict(X_test)
y_pred = grid_search.predict(X_test)
from sklearn.metrics import classification_report
print classification_report(y_test, y_pred)
print 'Test recall score: ', recall_score(y_test, y_pred)

谢谢，

Harry

这是一个过拟合的问题。当将交叉验证与过采样一起使用时，重要的是，过采样应仅应用于训练数据，而不应用于验证数据，即对于10倍交叉验证，9倍过采样数据将用作训练集，1倍过采样将用作不带过采样的验证集。

相关内容

最新更新

热门标签：