使用train_test_split后分类器准确度100%



我正在研究蘑菇分类数据集(在这里找到:https://www.kaggle.com/uciml/mushroom-classification(。

我正在尝试将数据拆分为模型的训练集和测试集,但是如果我使用 train_test_split 方法,我的模型始终达到 100% 的准确性。当我手动拆分数据时,情况并非如此。

x = data.copy()
y = x['class']
del x['class']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33)
model = xgb.XGBClassifier()
model.fit(x_train, y_train)
predictions = model.predict(x_test)
print(confusion_matrix(y_test, predictions))
print(accuracy_score(y_test, predictions))

这会产生:

[[1299    0]
[   0 1382]]
1.0

如果我手动拆分数据,我会得到更合理的结果。

x = data.copy()
y = x['class']
del x['class']
x_train = x[0:5443]
x_test = x[5444:]
y_train = y[0:5443]
y_test = y[5444:]
model = xgb.XGBClassifier()
model.fit(x_train, y_train)
predictions = model.predict(x_test)
print(confusion_matrix(y_test, predictions))
print(accuracy_score(y_test, predictions))

结果:

[[2007    0]
[ 336  337]]
0.8746268656716418

是什么原因导致这种行为?

编辑:根据要求,我包括切片的形状。

train_test_split:

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33)
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

结果:

(5443, 64)
(5443,)
(2681, 64)
(2681,)

手动拆分:

x_train = x[0:5443]
x_test = x[5444:]
y_train = y[0:5443]
y_test = y[5444:]
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

结果:

(5443, 64)
(5443,)
(2680, 64)
(2680,)

我尝试定义自己的拆分函数,结果拆分也会导致 100% 的分类器准确性。

这是拆分的代码

def split_data(dataFrame, testRatio):
dataCopy = dataFrame.copy()
testCount = int(len(dataFrame)*testRatio)
dataCopy = dataCopy.sample(frac = 1)
y = dataCopy['class']
del dataCopy['class']
return dataCopy[testCount:], dataCopy[0:testCount], y[testCount:], y[0:testCount]

你很幸运,你的train_test_split。 您手动执行的拆分可能具有最不可见的数据,这比内部打乱数据以拆分数据的train_test_split进行更好的验证。

为了更好地验证,请使用 K 折叠交叉验证,这将允许将数据中的每个不同部分作为测试来验证模型准确性,并将其余部分作为训练。

您的手动训练测试拆分没有随机播放,但 scikit 函数默认打开随机播放。拆分形状相同,但数据不同。

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

法典:

import numpy as np
from sklearn.model_selection import train_test_split
X, y = np.arange(18).reshape((9, 2)), range(9)
print(X)
print(list(y))
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42)
print("nTraining with shuffle:")
print(X_train)
print(y_train)

print("nTesting with shuffle:")
print(X_test)
print(y_test)

print("nWithout Shuffle:")
tmp = train_test_split(X, y, test_size=0.33, shuffle=False)
print(tmp[0])
print(tmp[2])
print()
print(tmp[1])
print(tmp[3])

输出:

[[ 0  1]
[ 2  3]
[ 4  5]
[ 6  7]
[ 8  9]
[10 11]
[12 13]
[14 15]
[16 17]]
[0, 1, 2, 3, 4, 5, 6, 7, 8]
Training with shuffle:
[[ 0  1]
[16 17]
[ 4  5]
[ 8  9]
[ 6  7]
[12 13]]
[0, 8, 2, 4, 3, 6]
Testing with shuffle:
[[14 15]
[ 2  3]
[10 11]]
[7, 1, 5]
Without Shuffle:
[[ 0  1]
[ 2  3]
[ 4  5]
[ 6  7]
[ 8  9]
[10 11]]
[0, 1, 2, 3, 4, 5]
[[12 13]
[14 15]
[16 17]]
[6, 7, 8]

事实证明结果是正确的,我只是在测试模型产生的结果时走错了路。

我打开了另一个线程,有人建议尝试交叉验证,这似乎可以解决问题。

最新更新