Text Classification Using spaCy

我试图用spacy进行一些文本分类，但我收到了一个错误，我的vucabulary为空。

我尝试了一个经典的数据集，但我得到了同样的错误，我看到了一些拆分文本部分的建议，但我有很多行，而不是很大的一行。

这是代码：

# 
df_amazon = pd.read_csv("amazon_alexa.tsv",sep="t")
bow_vector = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range = (1,1))
tfidf_vector = TfidfVectorizer(tokenizer = spacy_tokenizer)
classifier = LogisticRegression()
pipe = Pipeline ([("cleaner", predictors()),
("vectorizer", bow_vector),
("classifier", classifier)])
pipe.fit(X_train, y_train)
--------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-91-b5a14e655d5a> in <module>
10 
11 # Model generation
---> 12 pipe.fit(X_train, y_train)
~anaconda3libsite-packagessklearnpipeline.py in fit(self, X, y, **fit_params)
339         """
340         fit_params_steps = self._check_fit_params(**fit_params)
--> 341         Xt = self._fit(X, y, **fit_params_steps)
342         with _print_elapsed_time('Pipeline',
343                                  self._log_message(len(self.steps) - 1)):
~anaconda3libsite-packagessklearnpipeline.py in _fit(self, X, y, **fit_params_steps)
301                 cloned_transformer = clone(transformer)
302             # Fit or load from cache the current transformer
--> 303             X, fitted_transformer = fit_transform_one_cached(
304                 cloned_transformer, X, y, None,
305                 message_clsname='Pipeline',
~anaconda3libsite-packagesjoblibmemory.py in __call__(self, *args, **kwargs)
350 
351     def __call__(self, *args, **kwargs):
--> 352         return self.func(*args, **kwargs)
353 
354     def call_and_shelve(self, *args, **kwargs):
~anaconda3libsite-packagessklearnpipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
752     with _print_elapsed_time(message_clsname, message):
753         if hasattr(transformer, 'fit_transform'):
--> 754             res = transformer.fit_transform(X, y, **fit_params)
755         else:
756             res = transformer.fit(X, y, **fit_params).transform(X)
~anaconda3libsite-packagessklearnfeature_extractiontext.py in fit_transform(self, raw_documents, y)
1200         max_features = self.max_features
1201 
-> 1202         vocabulary, X = self._count_vocab(raw_documents,
1203                                           self.fixed_vocabulary_)
1204 
~anaconda3libsite-packagessklearnfeature_extractiontext.py in _count_vocab(self, raw_documents, fixed_vocab)
1131             vocabulary = dict(vocabulary)
1132             if not vocabulary:
-> 1133                 raise ValueError("empty vocabulary; perhaps the documents only"
1134                                  " contain stop words")
1135 
ValueError: empty vocabulary; perhaps the documents only contain stop words

看起来你只是在使用spaCy令牌化器？我不确定发生了什么，但您应该检查文档上标记器的输出。

注意，虽然我认为你可以用这种方式使用标记器，但更典型的是使用一个空白管道，比如：

import spacy
nlp = spacy.blank("en")
words = [tok.text for tok in nlp("this is my input text")]

相关内容

最新更新

热门标签：