sklearn 模型数据转换错误:计数矢量化器 - 词汇表未拟合



我已经训练了主题分类的模型。然后,当我将新数据转换为预测向量时,它会出错。它显示" NotFittedError:CountVectorizer-不合格词汇"。但是,当我通过将训练数据分配到经过训练的模型中的测试数据中进行预测时,它可以起作用。这是代码:

from sklearn.externals import joblib
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import numpy as np
# read new dataset
testdf = pd.read_csv('C://Users/KW198/Documents/topic_model/training_data/testdata.csv', encoding='cp950')
testdf.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1800 entries, 0 to 1799
Data columns (total 2 columns):
keywords    1800 non-null object
topics      1800 non-null int64
dtypes: int64(1), object(1)
memory usage: 28.2+ KB
# read columns
kw = testdf['keywords']
label = testdf['topics']
# 將預測資料轉為向量
vectorizer = CountVectorizer(min_df=1, stop_words='english')
x_testkw_vec = vectorizer.transform(kw)

这是一个错误

---------------------------------------------------------------------------
NotFittedError                            Traceback (most recent call last)
<ipython-input-93-cfcc7201e0f8> in <module>()
      1 # 將預測資料轉為向量
      2 vectorizer = CountVectorizer(min_df=1, stop_words='english')
----> 3 x_testkw_vec = vectorizer.transform(kw)
~Anaconda3envsztdllibsite-packagessklearnfeature_extractiontext.py in transform(self, raw_documents)
    918             self._validate_vocabulary()
    919 
--> 920         self._check_vocabulary()
    921 
    922         # use the same matrix-building strategy as fit_transform
~Anaconda3envsztdllibsite-packagessklearnfeature_extractiontext.py in _check_vocabulary(self)
    301         """Check if vocabulary is empty or missing (not fit-ed)"""
    302         msg = "%(name)s - Vocabulary wasn't fitted."
--> 303         check_is_fitted(self, 'vocabulary_', msg=msg),
    304 
    305         if len(self.vocabulary_) == 0:
~Anaconda3envsztdllibsite-packagessklearnutilsvalidation.py in check_is_fitted(estimator, attributes, msg, all_or_any)
    766 
    767     if not all_or_any([hasattr(estimator, attr) for attr in attributes]):
--> 768         raise NotFittedError(msg % {'name': type(estimator).__name__})
    769 
    770 
NotFittedError: CountVectorizer - Vocabulary wasn't fitted.

您需要调用 vectorizer.fit()以使count vectorizer在调用 vectorizer.transform()之前构建单词字典。您也可以调用将两者都结合在一起的vectorizer.fit_transform()

,但是您不应该使用新的矢量器进行测试或任何类型的推理。您需要使用训练模型时使用的相同的,否则结果将是随机的,因为词汇是不同的(缺少某些单词,没有相同的对齐等。(

(

为此,您只需腌制训练中使用的矢量器并将其加载到推理/测试时间。

最新更新