如何在python中读取文件并避免强制为unicode错误

我正在读取两个文件（trainfile, testfile），然后我想用word_vectorizer对它们进行矢量化，问题是我可能没有以正确的方式读取文件，这就是我尝试的：

# -- coding: utf-8 --
import codecs
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
import os, sys

with open('/Users/user/Desktop/train.txt', 'r') as trainfile:
    contenido_del_trainfile= trainfile.read()
    print contenido_del_trainfile
with open('/Users/user/Desktop/test.txt', 'r') as testfile:
    contenido_del_testfile= testfile.read()
    print contenido_del_testfile

print "nThis is the training corpus:n", contenido_del_trainfile
print "nThis is the test corpus:n", contenido_del_testfile

train = []
word_vectorizer = CountVectorizer(analyzer='word')
trainset = word_vectorizer.fit_transform(codecs.open(trainfile,'r','utf8'))
print word_vectorizer.get_feature_names()

这是输出：

TypeError: coercing to Unicode: need string or buffer, file found

我如何才能以正确的方式读取文件以便打印这样的东西：

[u'word',... ,u'word']

codecs.open断言，您提供的是文件的路径，而不是文件本身。

所以，不是

trainset = word_vectorizer.fit_transform(codecs.open(trainfile,'r','utf8'))

进行

trainset = word_vectorizer.fit_transform(codecs.open('/Users/user/Desktop/train.txt','r','utf8'))

相关内容

最新更新

热门标签：