我试图用spacy进行一些文本分类,但我收到了一个错误,我的vucabulary为空。
我尝试了一个经典的数据集,但我得到了同样的错误,我看到了一些拆分文本部分的建议,但我有很多行,而不是很大的一行。
这是代码:
#
df_amazon = pd.read_csv("amazon_alexa.tsv",sep="t")
bow_vector = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range = (1,1))
tfidf_vector = TfidfVectorizer(tokenizer = spacy_tokenizer)
classifier = LogisticRegression()
pipe = Pipeline ([("cleaner", predictors()),
("vectorizer", bow_vector),
("classifier", classifier)])
pipe.fit(X_train, y_train)
--------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-91-b5a14e655d5a> in <module>
10
11 # Model generation
---> 12 pipe.fit(X_train, y_train)
~anaconda3libsite-packagessklearnpipeline.py in fit(self, X, y, **fit_params)
339 """
340 fit_params_steps = self._check_fit_params(**fit_params)
--> 341 Xt = self._fit(X, y, **fit_params_steps)
342 with _print_elapsed_time('Pipeline',
343 self._log_message(len(self.steps) - 1)):
~anaconda3libsite-packagessklearnpipeline.py in _fit(self, X, y, **fit_params_steps)
301 cloned_transformer = clone(transformer)
302 # Fit or load from cache the current transformer
--> 303 X, fitted_transformer = fit_transform_one_cached(
304 cloned_transformer, X, y, None,
305 message_clsname='Pipeline',
~anaconda3libsite-packagesjoblibmemory.py in __call__(self, *args, **kwargs)
350
351 def __call__(self, *args, **kwargs):
--> 352 return self.func(*args, **kwargs)
353
354 def call_and_shelve(self, *args, **kwargs):
~anaconda3libsite-packagessklearnpipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
752 with _print_elapsed_time(message_clsname, message):
753 if hasattr(transformer, 'fit_transform'):
--> 754 res = transformer.fit_transform(X, y, **fit_params)
755 else:
756 res = transformer.fit(X, y, **fit_params).transform(X)
~anaconda3libsite-packagessklearnfeature_extractiontext.py in fit_transform(self, raw_documents, y)
1200 max_features = self.max_features
1201
-> 1202 vocabulary, X = self._count_vocab(raw_documents,
1203 self.fixed_vocabulary_)
1204
~anaconda3libsite-packagessklearnfeature_extractiontext.py in _count_vocab(self, raw_documents, fixed_vocab)
1131 vocabulary = dict(vocabulary)
1132 if not vocabulary:
-> 1133 raise ValueError("empty vocabulary; perhaps the documents only"
1134 " contain stop words")
1135
ValueError: empty vocabulary; perhaps the documents only contain stop words
看起来你只是在使用spaCy令牌化器?我不确定发生了什么,但您应该检查文档上标记器的输出。
注意,虽然我认为你可以用这种方式使用标记器,但更典型的是使用一个空白管道,比如:
import spacy
nlp = spacy.blank("en")
words = [tok.text for tok in nlp("this is my input text")]