我正在使用矩阵X
和此矩阵y
中的每一行的标签。 X
定义为:
df = pd.read_csv("./data/svm_matrix_0.csv", sep=',',header=None, encoding="ISO-8859-1")
df2 = df.convert_objects(convert_numeric=True)
X = df_2.values
y
定义为:
df = pd.read_csv('./data/Step7_final.csv', index_col=False, encoding="ISO-8859-1")
y = df.iloc[:, 1].values
然后,我将机器学习SVM应用于:
clf = svm.SVC(kernel='linear', C=1) #specify classifier
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2) #splitting randomly the training and test data
clf.fit(X_train,y_train) #training of machine
现在,我想更改X_train
的大小,并通过以下方式计算X_train
的每个值的火车和测试错误
test_error = clf.score(X_test, y_test)
train_error = clf.score(X_train, y_train)
X_train
的大小应增加(例如15个不同的值),然后将值以:{X_train size: (test_error, train_error)}
的形式存储在字典中。
我尝试了:
array = [0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9]
dicto = {}
for i in array:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = i)
clf.fit(X_train,y_train)
test = clf.score(X_test, y_test)
train = clf.score(X_train, y_train)
dicto[i] = test, train
print(dicto)
,但由于我也在改变X_test
,因此无法正常工作。有人知道如何调整我的代码,因为它仅与X_train
的大小变化(以便以增加的总数据集大小计算错误)?
您可以做的是首先分开测试数据...
X_train_prev, X_test_prev, y_train_prev, y_test_prev = train_test_split(X, y, test_size = 0.2)
现在运行循环更改火车尺寸,但在**之前的测试数据上进行测试*
喜欢 -
array = [0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9]
dicto = {}
for i in array:
X_train, _, y_train, _ = train_test_split(X, y, test_size = i)
clf.fit(X_train,y_train)
#use the previous test data...
test = clf.score(X_test_prev, y_test_prev)
train = clf.score(X_train, y_train)
dicto[i] = test, train
print(dicto)
但是请注意,我所做的可能会在数据是随机的数据中降低模型度量分数,我们也污染了测试数据。因此,您可以采取的措施避免它在火车数据上拆分,以使您的测试数据保持分开!
这样(for循环中的线) -
X_train, _, y_train, _ = train_test_split(X_train_prev, y_train_prev, test_size = i)