在本地部署模型时出现问题



我通过查看文本创建了一个预测网站类型的模型。

但它似乎不起作用。我已经将模型、矢量器、标签编码器存储在pickle文件中,并在此处加载

代码:

import pandas as pd
import sklearn.metrics as sm
import nltk
import string
from nltk.tokenize import word_tokenize
import re
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import pickle
import os
def clean_text(text):
#### cleaning the text 
###1. Convert the text to lower case
text= text.lower()
###2. tokenize the sentences to words
text_list= word_tokenize(text)
###3. Removes the special charcters
special_char_non_text= [re.sub(f'[{string.punctuation}]+','',i) for i in text_list]
###4.  remove stopwords
non_stopwords_text= [i for i in special_char_non_text if i not in stopwords.words('english')]
###5. lemmatize the words
lemmatizer= WordNetLemmatizer()
lemmatized_words= [lemmatizer.lemmatize(i) for i in non_stopwords_text]
cleaned_text= ' '.join(lemmatized_words)
return cleaned_text
text_input= input('Please enter the text: ')
cleaned_text= clean_text(text_input)
temp_df= pd.DataFrame({'input_text':[cleaned_text.strip()]})
vectorizer_filepath= 'tf_idf_vectorizer.pkl'
tf_idf_vectorizer= pickle.load(open(vectorizer_filepath,'rb'))
temp_df_1= tf_idf_vectorizer.transform(temp_df)
input_df= pd.DataFrame(temp_df_1.toarray(),columns=tf_idf_vectorizer.get_feature_names())
### load the model
model_path='multinomial_clf.pkl'
model_clf= pickle.load(open(model_path,'rb'))
y_pred= model_clf.predict(input_df)
#print(y_pred)
### load the label encoder
label_encoder_file= 'label_encoder.pkl'
label_encoder= pickle.load(open(label_encoder_file,'rb'))
label_class= label_encoder.inverse_transform(y_pred.ravel())
print(f'{label_class} is the predicted class')

我得到一个错误:

KeyError                                  Traceback (most recent call last)
~anaconda3libsite-packagessklearnpreprocessing_label.py in _encode_python(values, uniques, encode)
65         try:
---> 66             encoded = np.array([table[v] for v in values])
67         except KeyError as e:
~anaconda3libsite-packagessklearnpreprocessing_label.py in <listcomp>(.0)
65         try:
---> 66             encoded = np.array([table[v] for v in values])
67         except KeyError as e:
KeyError: 'website booking flight  bus ticket'
During handling of the above exception, another exception occurred:
ValueError                                Traceback (most recent call last)
<ipython-input-21-b92cbf8dfe74> in <module>
5 vectorizer_filepath= 'tf_idf_vectorizer.pkl'
6 tf_idf_vectorizer= pickle.load(open(vectorizer_filepath,'rb'))
----> 7 temp_df_1= tf_idf_vectorizer.transform(temp_df)
8 input_df= pd.DataFrame(temp_df_1.toarray(),columns=tf_idf_vectorizer.get_feature_names())
9 
~anaconda3libsite-packagessklearnpreprocessing_label.py in transform(self, y)
275             return np.array([])
276 
--> 277         _, y = _encode(y, uniques=self.classes_, encode=True)
278         return y
279 
~anaconda3libsite-packagessklearnpreprocessing_label.py in _encode(values, uniques, encode, check_unknown)
111     if values.dtype == object:
112         try:
--> 113             res = _encode_python(values, uniques, encode)
114         except TypeError:
115             types = sorted(t.__qualname__
~anaconda3libsite-packagessklearnpreprocessing_label.py in _encode_python(values, uniques, encode)
66             encoded = np.array([table[v] for v in values])
67         except KeyError as e:
---> 68             raise ValueError("y contains previously unseen labels: %s"
69                              % str(e))
70         return uniques, encoded
ValueError: y contains previously unseen labels: 'website booking flight  bus ticket'

我使用的输入文本值为这是预订航班、公共汽车票的网站

我不知道为什么会发生这样的

有人能帮我解决这个问题吗?

如果没有数据和经过训练的模型,就无法准确判断,但我注意到了一些事情:

  1. ###3中,空字符串似乎可以保留在后面(如果标记仅由标点符号组成(,并且之后似乎不会以任何方式删除它们。您对整个文本进行了strip((操作,但这只会删除一个额外的第一个空格和一个多余的最后一个空格,而不会删除文本中可能的两个或更高的空格。您也可以在错误消息中看到这一点。

  2. 您将整个DataFrame交给tf_idf_vectorizer.transform(),但它需要可迭代的文档。像这样遍历整个DataFrame将遍历列,而不是行。尝试tf_idf_vectorizer.transform(temp_df['input_text'])

  3. 您调用transform()而不是fit_transform(),所以所有词汇表都需要由模型知道,是这样吗?

  4. 据我所知,TfidfVectorizer已经内置了一个预处理器,你是否在pickle对象的clean方法中覆盖了它?如果是,为什么要再次手动清洁?错误消息显示了一个未标记化的字符串,这似乎表明内置的标记化器没有按应有的方式运行,试图从词汇表中获取未标记化字符串'website booking flight bus ticket'的向量,但失败了。您应该让TfidfVectorizer进行预处理,或者正确使用属性preprocessor并将您的清洁方法(的修改版本(交给它。查看此线程:如何将预处理程序传递给TfidfVectorizer?-sklearn-python。

最新更新