我有一组文档和一组标签。现在,我正在使用train_test_split以 90:10 的比例拆分我的数据集。但是,我希望使用 Kfold 交叉验证。
train=[]
with open("/Users/rte/Documents/Documents.txt") as f:
for line in f:
train.append(line.strip().split())
labels=[]
with open("/Users/rte/Documents/Labels.txt") as t:
for line in t:
labels.append(line.strip().split())
X_train, X_test, Y_train, Y_test= train_test_split(train, labels, test_size=0.1, random_state=42)
当我尝试scikit learn文档中提供的方法时:我收到一个错误,上面写着:
kf=KFold(len(train), n_folds=3)
for train_index, test_index in kf:
X_train, X_test = train[train_index],train[test_index]
y_train, y_test = labels[train_index],labels[test_index]
错误
X_train, X_test = train[train_index],train[test_index]
TypeError: only integer arrays with one element can be converted to an index
如何对文档和标签执行 10 倍交叉验证?
有两种方法可以解决此错误:
第一种方式:
将数据转换为 numpy 数组:
import numpy as np
[...]
train = np.array(train)
labels = np.array(labels)
然后它应该适用于您当前的代码。
第二种方式:
使用列表推导将训练和标签列表与train_index和test_index列表进行索引
for train_index, test_index in kf:
X_train, X_test = [train[i] for i in train_index],[train[j] for j in test_index]
y_train, y_test = [labels[i] for i in train_index],[labels[j] for j in test_index]
(对于此解决方案,另请参阅具有另一个列表的相关问题索引列表)