我正在研究蘑菇分类数据集(在这里找到:https://www.kaggle.com/uciml/mushroom-classification(。
我正在尝试将数据拆分为模型的训练集和测试集,但是如果我使用 train_test_split 方法,我的模型始终达到 100% 的准确性。当我手动拆分数据时,情况并非如此。
x = data.copy()
y = x['class']
del x['class']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33)
model = xgb.XGBClassifier()
model.fit(x_train, y_train)
predictions = model.predict(x_test)
print(confusion_matrix(y_test, predictions))
print(accuracy_score(y_test, predictions))
这会产生:
[[1299 0]
[ 0 1382]]
1.0
如果我手动拆分数据,我会得到更合理的结果。
x = data.copy()
y = x['class']
del x['class']
x_train = x[0:5443]
x_test = x[5444:]
y_train = y[0:5443]
y_test = y[5444:]
model = xgb.XGBClassifier()
model.fit(x_train, y_train)
predictions = model.predict(x_test)
print(confusion_matrix(y_test, predictions))
print(accuracy_score(y_test, predictions))
结果:
[[2007 0]
[ 336 337]]
0.8746268656716418
是什么原因导致这种行为?
编辑:根据要求,我包括切片的形状。
train_test_split:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33)
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)
结果:
(5443, 64)
(5443,)
(2681, 64)
(2681,)
手动拆分:
x_train = x[0:5443]
x_test = x[5444:]
y_train = y[0:5443]
y_test = y[5444:]
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)
结果:
(5443, 64)
(5443,)
(2680, 64)
(2680,)
我尝试定义自己的拆分函数,结果拆分也会导致 100% 的分类器准确性。
这是拆分的代码
def split_data(dataFrame, testRatio):
dataCopy = dataFrame.copy()
testCount = int(len(dataFrame)*testRatio)
dataCopy = dataCopy.sample(frac = 1)
y = dataCopy['class']
del dataCopy['class']
return dataCopy[testCount:], dataCopy[0:testCount], y[testCount:], y[0:testCount]
你很幸运,你的train_test_split。 您手动执行的拆分可能具有最不可见的数据,这比内部打乱数据以拆分数据的train_test_split进行更好的验证。
为了更好地验证,请使用 K 折叠交叉验证,这将允许将数据中的每个不同部分作为测试来验证模型准确性,并将其余部分作为训练。
您的手动训练测试拆分没有随机播放,但 scikit 函数默认打开随机播放。拆分形状相同,但数据不同。
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
法典:
import numpy as np
from sklearn.model_selection import train_test_split
X, y = np.arange(18).reshape((9, 2)), range(9)
print(X)
print(list(y))
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42)
print("nTraining with shuffle:")
print(X_train)
print(y_train)
print("nTesting with shuffle:")
print(X_test)
print(y_test)
print("nWithout Shuffle:")
tmp = train_test_split(X, y, test_size=0.33, shuffle=False)
print(tmp[0])
print(tmp[2])
print()
print(tmp[1])
print(tmp[3])
输出:
[[ 0 1]
[ 2 3]
[ 4 5]
[ 6 7]
[ 8 9]
[10 11]
[12 13]
[14 15]
[16 17]]
[0, 1, 2, 3, 4, 5, 6, 7, 8]
Training with shuffle:
[[ 0 1]
[16 17]
[ 4 5]
[ 8 9]
[ 6 7]
[12 13]]
[0, 8, 2, 4, 3, 6]
Testing with shuffle:
[[14 15]
[ 2 3]
[10 11]]
[7, 1, 5]
Without Shuffle:
[[ 0 1]
[ 2 3]
[ 4 5]
[ 6 7]
[ 8 9]
[10 11]]
[0, 1, 2, 3, 4, 5]
[[12 13]
[14 15]
[16 17]]
[6, 7, 8]
事实证明结果是正确的,我只是在测试模型产生的结果时走错了路。
我打开了另一个线程,有人建议尝试交叉验证,这似乎可以解决问题。