我有一个术语文档矩阵和相应的标签矩阵,我必须将数据集分成10个部分,并使用任意7个部分来训练libsvm分类器,并对其余3个部分进行测试。我必须对所有可能的情况都这样做,即10C7。这是使用SVM进行训练和测试的代码,我无法理解如何对所有情况进行分类和迭代。
m = svm_train(labels[0:2000], rows_1[0:2000], '-c '+str(k)+' -g '+str(g))
p_label, p_acc, p_val = svm_predict(labels[2000:0], rows_1[2000:0], m)
acc.append(p_acc)
其中'labels'为标签数组,'rows_1'为术语文档矩阵的行。我是新手,请帮忙!
您必须对数据进行洗牌,并为训练和测试折叠创建索引。例如,如果你有2000个训练样本,你想使用10个折叠,那么你将有:
fold1
test[0:200]
train[200:2000]
fold2
test[200:400]
train[0:200, 400:2000]
etc
下面是一个Python示例代码:
import numpy as np
indices = np.random.permutation(2000) # create a list of 2000 unique numbers in random order
n_folds = 10
fold_step = int(2000 / n_folds)
acc = []
for fold in range(0, 2000, fold_step):
test_labels = [labels[i] for i in indices[fold:fold+fold_step]]
train_labels = [l for l in labels if l not in test_labels]
test_rows = [rows_1[i] for i in indices[fold:fold+fold_step]]
train_rows = [r for r in rows_1 if r not in test_rows]
m = svm_train(train_labels, train_rows, '-c '+str(k)+' -g '+str(g))
p_label, p_acc, p_val = svm_predict(test_labels, test_rows, m)
acc.append(p_acc)
print("Accuracy: {}%".format(np.mean(acc)))