我正在构建一个程序,该程序为文本描述分配多个标签/标签。我正在使用OneVsRestClassifier来标记我的文本描述。xTrain,xTest和yTrain都是'numpy.ndarray'
的。考虑到我以正确的方式拆分了训练和测试数据,这似乎很奇怪。下面是我的代码:
xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size=0.2)
nb_clf = MultinomialNB()
sgd = SGDClassifier()
lr = LogisticRegression()
mn = MultinomialNB()
print("xTrain.shape = " + str(xTrain.shape))
print("xTest.shape = " + str(xTest.shape))
print("yTrain.shape = " + str(yTrain.shape))
print("yTest.shape = " + str(yTest.shape))
print("type(xTrain) = " + str(type(xTrain)))
print("type(xTest) = " + str(type(xTest)))
xTrain = csr_matrix(xTrain).toarray()
xTest = csr_matrix(xTest).toarray()
yTrain = csr_matrix(yTrain).toarray()
print("type(xTrain) = " + str(type(xTrain)))
for classifier in [nb_clf, sgd, lr, mn]:
clf = OneVsRestClassifier(classifier)
clf.fit(xTrain.astype("U"), yTrain.astype("U"))
y_pred = clf.predict(xTest)
print("ny_pred:")
print(y_pred)
x 输出:
(1466, 1292) 0.13531037414782607
(1466, 1238) 0.21029405543816293
(1466, 988) 0.04688335706505732
...
...
y 输出:
[[0 0 0 ... 1 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 1 0 0]
...
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]]
打印语句输出:
xTrain.shape = (1173, 13817)
xTest.shape = (294, 13817)
yTrain.shape = (1173, 28)
yTest.shape = (294, 28)
type(xTrain) = <class 'scipy.sparse.csr.csr_matrix'>
type(xTest) = <class 'scipy.sparse.csr.csr_matrix'>
type(xTrain) = <class 'numpy.ndarray'>
type(xTest) = <class 'numpy.ndarray'>
type(yTrain) = <class 'numpy.ndarray'>
错误(在 clf.fit 行(:
值错误:标签不支持多输出目标数据 二值化
请首先澄清程序中的特征维度和样本大小。对于目标特征(y
(,标签不应该是独热编码的。例如,它应该是 [3] 而不是 [0 0 0 1]。