CountVectorizer(): StreamBackedCorpusView' 对象没有属性'lower'



我正在尝试使用以下代码在NLTK电影评论语料库上运行并实例化CountVectorizer():

>>>import nltk
>>>import nltk.corpus
>>>from sklearn.feature_extraction.text import CountVectorizer
>>>from nltk.corpus import movie_reviews
>>>neg_rev = movie_reviews.fileids('neg')
>>>pos_rev = movie_reviews.fileids('pos')
>>>rev_list = [] # Empty List
>>>for rev in neg_rev:
rev_list.append(nltk.corpus.movie_reviews.words(rev))
>>>for rev_pos in pos_rev:
rev_list.append(nltk.corpus.movie_reviews.words(rev_pos))
>>>count_vect = CountVectorizer()
>>>X_count_vect = count_vect.fit_transform(rev_list)

我收到以下错误:

AttributeError                            Traceback (most recent call last)
<ipython-input-37-00e9047daa67> in <module>()
----> 1 X_count_vect = count_vect.fit_transform(rev_list)
C:ProgramDataAnaconda3libsite-packagessklearnfeature_extractiontext.py in fit_transform(self, raw_documents, y)
837 
838         vocabulary, X = self._count_vocab(raw_documents,
--> 839                                           self.fixed_vocabulary_)
840 
841         if self.binary:
C:ProgramDataAnaconda3libsite-packagessklearnfeature_extractiontext.py in _count_vocab(self, raw_documents, fixed_vocab)
760         for doc in raw_documents:
761             feature_counter = {}
--> 762             for feature in analyze(doc):
763                 try:
764                     feature_idx = vocabulary[feature]
C:ProgramDataAnaconda3libsite-packagessklearnfeature_extractiontext.py in <lambda>(doc)
239 
240             return lambda doc: self._word_ngrams(
--> 241                 tokenize(preprocess(self.decode(doc))), stop_words)
242 
243         else:
C:ProgramDataAnaconda3libsite-packagessklearnfeature_extractiontext.py in <lambda>(x)
205 
206         if self.lowercase:
--> 207             return lambda x: strip_accents(x.lower())
208         else:
209             return strip_accents
AttributeError: 'StreamBackedCorpusView' object has no attribute 'lower'

nltk.corpus.movie_reviews.words(rev_pos)已经标记了句子....如:

['films', 'adapted', 'from', 'comic', 'books', 'have', ...]

谁能告诉我我做错了什么?我想我在创建(rev_list)电影评论列表时正在尝试一些步骤。

蒂亚

看起来你的.words()函数实际上并没有给你一个标记列表,而是一系列的StreamBackedCorpusView类。此类允许您检索令牌,但实际上并不是令牌本身的完整表示形式。

但是,您可以从视图中检索令牌。有关使用StreamBackCorpusView的更多详细信息,请参阅下面的链接。

http://nltk.sourceforge.net/corpusview/corpusview.StreamBackedCorpusView-class.html

相关内容

  • 没有找到相关文章

最新更新