我正在使用(第一次)scikit库,我得到了这个错误:
ValueError: empty vocabulary; perhaps the documents only contain stop words
File "C:UsersA605563DesktopvelibProjetPresoTraitementTwitterDico.py", line 33, in <module>
X_train_counts = count_vect.fit_transform(FileTweets)
File "C:Python27Libsite-packagessklearnfeature_extractiontext.py", line 804, in fit_transform
self.fixed_vocabulary_)
File "C:Python27Libsite-packagessklearnfeature_extractiontext.py", line 751, in _count_vocab
raise ValueError("empty vocabulary; perhaps the documents only contain stop words
但我不明白为什么会这样。
import sklearn
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import numpy
import unicodedata
import nltk
TweetsFile = open('tweets2015-08-13.csv', 'r+')
f2 = open('analyzer.txt', 'a')
print TweetsFile.readline()
count_vect = CountVectorizer(strip_accents='ascii')
FileTweets = TweetsFile.read()
FileTweets = FileTweets.decode('latin1')
FileTweets = unicodedata.normalize('NFKD', FileTweets).encode('ascii','ignore')
print FileTweets
for line in TweetsFile:
f2.write(line.replace('n', ' '))
TweetsFile = f2
print type(FileTweets)
X_train_counts = count_vect.fit_transform(FileTweets)
print X_train_counts.shape
TweetsFile.close()
我的数据是原始推文:
11/8/2015 @ Paris Marriott Champs Elysees Hotel "
2015-08-11 21:27:15,"I'm at Paris Marriott Hotel Champs-Elysees in Paris, FR <https://t.co/gAFspVw6FC>"
2015-08-11 21:24:08,"I'm at Four Seasons Hotel George V in Paris, Ile-de-France <https://t.co/dtPALvziWy>"
2015-08-11 21:22:11, . @ Avenue des Champs-Elysees <https://t.co/8b7U05OAxG>
2015-08-11 20:54:18,Her pistol go @ Raspoutine Paris (Official) <https://t.co/le9l3dtdgM>
2015-08-11 20:50:14,"Desde Paris, con amor. @ Avenue des Champs-Elysees <https://t.co/R68JV3NT1z>"
有人知道这里发生了什么吗?
这是一个更简单的解决方案:
x = open('bad_words_train.txt', 'r+')
count_vect = CountVectorizer(input=file)
X_train = count_vect.fit_transform(x)
print(X_train)
我找到了一个解决方案:
import sklearn
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import numpy as np
import unicodedata
import nltk
from io import StringIO
TweetsFile = open('tweets2015-08-13.csv','r+')
yourResult = [line.split(',') for line in TweetsFile.readlines()]
count_vect = CountVectorizer(input="file")
docs_new = [ StringIO.StringIO(x) for x in yourResult ]
X_train_counts = count_vect.fit_transform(docs_new)
vocab = count_vect.get_feature_names()
print X_train_counts.shape