我正在尝试用tfidf和朴素贝叶斯分类器对我的文本数据进行分类
cls = MultinomialNB()
vec = TfidfVectorizer(input='file', analyzer=word_tokenize, stop_words=stop_w, use_idf=False)
for i, filename in enumerate(files):
with codecs.open(filename, encoding='utf8') as f:
bow = vec.fit_transform(f)
# and i have one target for this bow. (each file has unique subject)
y = np.array([repeat(i, times=41253)])
cls.fit(bow, y)
弓。形状输出如下图
(41253, 15987)
但是得到了这个异常
Traceback (most recent call last):
File "/home/x/PycharmProjects/PWC/naiive.py", line 35, in <module>
cls.fit(bow, y)
File "/usr/local/lib/python2.7/dist-packages/sklearn/naive_bayes.py", line 522, in fit
X, y = check_X_y(X, y, 'csr')
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 516, in check_X_y
check_consistent_length(X, y)
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 176, in check_consistent_length
"%s" % str(uniques))
ValueError: Found arrays with inconsistent numbers of samples: [ 1 41253]
我知道我的尺寸/形状有问题,但我不知道该如何解决它我的实现一开始是正确的吗?
这一行应该是:
y = repeat(i, times=41253)
删除额外的分号和np.array()调用