我正在读取两个文件(trainfile, testfile
),然后我想用word_vectorizer
对它们进行矢量化,问题是我可能没有以正确的方式读取文件,这就是我尝试的:
# -- coding: utf-8 --
import codecs
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
import os, sys
with open('/Users/user/Desktop/train.txt', 'r') as trainfile:
contenido_del_trainfile= trainfile.read()
print contenido_del_trainfile
with open('/Users/user/Desktop/test.txt', 'r') as testfile:
contenido_del_testfile= testfile.read()
print contenido_del_testfile
print "nThis is the training corpus:n", contenido_del_trainfile
print "nThis is the test corpus:n", contenido_del_testfile
train = []
word_vectorizer = CountVectorizer(analyzer='word')
trainset = word_vectorizer.fit_transform(codecs.open(trainfile,'r','utf8'))
print word_vectorizer.get_feature_names()
这是输出:
TypeError: coercing to Unicode: need string or buffer, file found
我如何才能以正确的方式读取文件以便打印这样的东西:
[u'word',... ,u'word']
codecs.open
断言,您提供的是文件的路径,而不是文件本身。
所以,不是
trainset = word_vectorizer.fit_transform(codecs.open(trainfile,'r','utf8'))
进行
trainset = word_vectorizer.fit_transform(codecs.open('/Users/user/Desktop/train.txt','r','utf8'))