如何在python中读取文件并避免强制为unicode错误



我正在读取两个文件(trainfile, testfile),然后我想用word_vectorizer对它们进行矢量化,问题是我可能没有以正确的方式读取文件,这就是我尝试的:

# -- coding: utf-8 --
import codecs
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
import os, sys

with open('/Users/user/Desktop/train.txt', 'r') as trainfile:
    contenido_del_trainfile= trainfile.read()
    print contenido_del_trainfile
with open('/Users/user/Desktop/test.txt', 'r') as testfile:
    contenido_del_testfile= testfile.read()
    print contenido_del_testfile

print "nThis is the training corpus:n", contenido_del_trainfile
print "nThis is the test corpus:n", contenido_del_testfile

train = []
word_vectorizer = CountVectorizer(analyzer='word')
trainset = word_vectorizer.fit_transform(codecs.open(trainfile,'r','utf8'))
print word_vectorizer.get_feature_names()

这是输出:

TypeError: coercing to Unicode: need string or buffer, file found

我如何才能以正确的方式读取文件以便打印这样的东西:

[u'word',... ,u'word']

codecs.open断言,您提供的是文件的路径,而不是文件本身。

所以,不是

trainset = word_vectorizer.fit_transform(codecs.open(trainfile,'r','utf8'))

进行

trainset = word_vectorizer.fit_transform(codecs.open('/Users/user/Desktop/train.txt','r','utf8'))

最新更新