使用train_test_split后分类器准确度100%

我正在研究蘑菇分类数据集(在这里找到：https://www.kaggle.com/uciml/mushroom-classification(。

我正在尝试将数据拆分为模型的训练集和测试集，但是如果我使用 train_test_split 方法，我的模型始终达到 100% 的准确性。当我手动拆分数据时，情况并非如此。

x = data.copy()
y = x['class']
del x['class']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33)
model = xgb.XGBClassifier()
model.fit(x_train, y_train)
predictions = model.predict(x_test)
print(confusion_matrix(y_test, predictions))
print(accuracy_score(y_test, predictions))

这会产生：

[[1299    0]
[   0 1382]]
1.0

如果我手动拆分数据，我会得到更合理的结果。

x = data.copy()
y = x['class']
del x['class']
x_train = x[0:5443]
x_test = x[5444:]
y_train = y[0:5443]
y_test = y[5444:]
model = xgb.XGBClassifier()
model.fit(x_train, y_train)
predictions = model.predict(x_test)
print(confusion_matrix(y_test, predictions))
print(accuracy_score(y_test, predictions))

结果：

[[2007    0]
[ 336  337]]
0.8746268656716418

是什么原因导致这种行为？

编辑：根据要求，我包括切片的形状。

train_test_split：

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33)
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

结果：

(5443, 64)
(5443,)
(2681, 64)
(2681,)

手动拆分：

x_train = x[0:5443]
x_test = x[5444:]
y_train = y[0:5443]
y_test = y[5444:]
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

结果：

(5443, 64)
(5443,)
(2680, 64)
(2680,)

我尝试定义自己的拆分函数，结果拆分也会导致 100% 的分类器准确性。

这是拆分的代码

def split_data(dataFrame, testRatio):
dataCopy = dataFrame.copy()
testCount = int(len(dataFrame)*testRatio)
dataCopy = dataCopy.sample(frac = 1)
y = dataCopy['class']
del dataCopy['class']
return dataCopy[testCount:], dataCopy[0:testCount], y[testCount:], y[0:testCount]

你很幸运，你的train_test_split。您手动执行的拆分可能具有最不可见的数据，这比内部打乱数据以拆分数据的train_test_split进行更好的验证。

为了更好地验证，请使用 K 折叠交叉验证，这将允许将数据中的每个不同部分作为测试来验证模型准确性，并将其余部分作为训练。

您的手动训练测试拆分没有随机播放，但 scikit 函数默认打开随机播放。拆分形状相同，但数据不同。

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

法典：

import numpy as np
from sklearn.model_selection import train_test_split
X, y = np.arange(18).reshape((9, 2)), range(9)
print(X)
print(list(y))
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42)
print("nTraining with shuffle:")
print(X_train)
print(y_train)

print("nTesting with shuffle:")
print(X_test)
print(y_test)

print("nWithout Shuffle:")
tmp = train_test_split(X, y, test_size=0.33, shuffle=False)
print(tmp[0])
print(tmp[2])
print()
print(tmp[1])
print(tmp[3])

输出：

[[ 0  1]
[ 2  3]
[ 4  5]
[ 6  7]
[ 8  9]
[10 11]
[12 13]
[14 15]
[16 17]]
[0, 1, 2, 3, 4, 5, 6, 7, 8]
Training with shuffle:
[[ 0  1]
[16 17]
[ 4  5]
[ 8  9]
[ 6  7]
[12 13]]
[0, 8, 2, 4, 3, 6]
Testing with shuffle:
[[14 15]
[ 2  3]
[10 11]]
[7, 1, 5]
Without Shuffle:
[[ 0  1]
[ 2  3]
[ 4  5]
[ 6  7]
[ 8  9]
[10 11]]
[0, 1, 2, 3, 4, 5]
[[12 13]
[14 15]
[16 17]]
[6, 7, 8]

事实证明结果是正确的，我只是在测试模型产生的结果时走错了路。

我打开了另一个线程，有人建议尝试交叉验证，这似乎可以解决问题。

相关内容

最新更新

热门标签：