文本文档分类ValueError: X和y有不兼容的形状



我正在尝试按类别对文档进行分类。我想训练几个类别的数据,然后给它一些文本,让它告诉我文本属于哪个类别。对于培训,我使用的是20个新闻组。我得到这个错误"ValueError: X和y有不兼容的形状X有5个样本,但y有4个"在分类器。fit(X_train, Y)。

谁能告诉我为什么X有5个样本,X来自装载4个类别的data_train ?我也将非常感谢任何帮助,以更好的方式做这件事。

import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn import preprocessing
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier

remove = ()
categories = [ 'alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']

categories_test = ['sci.space' ]

print("Loading newsgroups dataset for categories:")
data_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True,     random_state=42, remove=remove)
data_test = fetch_20newsgroups(subset='test', categories=categories_test, shuffle=True, random_state=42, remove=remove)

X_test = data_test
X_train = data_train
y_train = data_train.target_names
lb = preprocessing.LabelBinarizer()
Y = lb.fit_transform(y_train)
classifier = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC()))])
classifier.fit(X_train, Y)
predicted = classifier.predict(X_test)
all_labels = lb.inverse_transform(predicted)
for item, labels in zip(X_test.target_names, all_labels):
    print '%s => %s' % (item, ', '.join(labels))

问题在这里:

X_train = data_train
y_train = data_train.target_names

您的data_train是一个对象,而不是一个样本数组,因此您将此对象存储在X_train中,而您只想要输入参数(可能在data_train.data字段中)。此外,"target_names"是标签的名称,而不是实际的标签(如果我记得正确,则存储在.target中)

应该是

X_train = data_train.data
y_train = data_train.target

同样适用于"data_test"。

相关内容

  • 没有找到相关文章

最新更新