我通过查看文本创建了一个预测网站类型的模型。
但它似乎不起作用。我已经将模型、矢量器、标签编码器存储在pickle文件中,并在此处加载
代码:
import pandas as pd
import sklearn.metrics as sm
import nltk
import string
from nltk.tokenize import word_tokenize
import re
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import pickle
import os
def clean_text(text):
#### cleaning the text
###1. Convert the text to lower case
text= text.lower()
###2. tokenize the sentences to words
text_list= word_tokenize(text)
###3. Removes the special charcters
special_char_non_text= [re.sub(f'[{string.punctuation}]+','',i) for i in text_list]
###4. remove stopwords
non_stopwords_text= [i for i in special_char_non_text if i not in stopwords.words('english')]
###5. lemmatize the words
lemmatizer= WordNetLemmatizer()
lemmatized_words= [lemmatizer.lemmatize(i) for i in non_stopwords_text]
cleaned_text= ' '.join(lemmatized_words)
return cleaned_text
text_input= input('Please enter the text: ')
cleaned_text= clean_text(text_input)
temp_df= pd.DataFrame({'input_text':[cleaned_text.strip()]})
vectorizer_filepath= 'tf_idf_vectorizer.pkl'
tf_idf_vectorizer= pickle.load(open(vectorizer_filepath,'rb'))
temp_df_1= tf_idf_vectorizer.transform(temp_df)
input_df= pd.DataFrame(temp_df_1.toarray(),columns=tf_idf_vectorizer.get_feature_names())
### load the model
model_path='multinomial_clf.pkl'
model_clf= pickle.load(open(model_path,'rb'))
y_pred= model_clf.predict(input_df)
#print(y_pred)
### load the label encoder
label_encoder_file= 'label_encoder.pkl'
label_encoder= pickle.load(open(label_encoder_file,'rb'))
label_class= label_encoder.inverse_transform(y_pred.ravel())
print(f'{label_class} is the predicted class')
我得到一个错误:
KeyError Traceback (most recent call last)
~anaconda3libsite-packagessklearnpreprocessing_label.py in _encode_python(values, uniques, encode)
65 try:
---> 66 encoded = np.array([table[v] for v in values])
67 except KeyError as e:
~anaconda3libsite-packagessklearnpreprocessing_label.py in <listcomp>(.0)
65 try:
---> 66 encoded = np.array([table[v] for v in values])
67 except KeyError as e:
KeyError: 'website booking flight bus ticket'
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
<ipython-input-21-b92cbf8dfe74> in <module>
5 vectorizer_filepath= 'tf_idf_vectorizer.pkl'
6 tf_idf_vectorizer= pickle.load(open(vectorizer_filepath,'rb'))
----> 7 temp_df_1= tf_idf_vectorizer.transform(temp_df)
8 input_df= pd.DataFrame(temp_df_1.toarray(),columns=tf_idf_vectorizer.get_feature_names())
9
~anaconda3libsite-packagessklearnpreprocessing_label.py in transform(self, y)
275 return np.array([])
276
--> 277 _, y = _encode(y, uniques=self.classes_, encode=True)
278 return y
279
~anaconda3libsite-packagessklearnpreprocessing_label.py in _encode(values, uniques, encode, check_unknown)
111 if values.dtype == object:
112 try:
--> 113 res = _encode_python(values, uniques, encode)
114 except TypeError:
115 types = sorted(t.__qualname__
~anaconda3libsite-packagessklearnpreprocessing_label.py in _encode_python(values, uniques, encode)
66 encoded = np.array([table[v] for v in values])
67 except KeyError as e:
---> 68 raise ValueError("y contains previously unseen labels: %s"
69 % str(e))
70 return uniques, encoded
ValueError: y contains previously unseen labels: 'website booking flight bus ticket'
我使用的输入文本值为这是预订航班、公共汽车票的网站
我不知道为什么会发生这样的
有人能帮我解决这个问题吗?
如果没有数据和经过训练的模型,就无法准确判断,但我注意到了一些事情:
-
在
###3
中,空字符串似乎可以保留在后面(如果标记仅由标点符号组成(,并且之后似乎不会以任何方式删除它们。您对整个文本进行了strip((操作,但这只会删除一个额外的第一个空格和一个多余的最后一个空格,而不会删除文本中可能的两个或更高的空格。您也可以在错误消息中看到这一点。 -
您将整个DataFrame交给
tf_idf_vectorizer.transform()
,但它需要可迭代的文档。像这样遍历整个DataFrame将遍历列,而不是行。尝试tf_idf_vectorizer.transform(temp_df['input_text'])
。 -
您调用
transform()
而不是fit_transform()
,所以所有词汇表都需要由模型知道,是这样吗? -
据我所知,TfidfVectorizer已经内置了一个预处理器,你是否在pickle对象的clean方法中覆盖了它?如果是,为什么要再次手动清洁?错误消息显示了一个未标记化的字符串,这似乎表明内置的标记化器没有按应有的方式运行,试图从词汇表中获取未标记化字符串
'website booking flight bus ticket'
的向量,但失败了。您应该让TfidfVectorizer进行预处理,或者正确使用属性preprocessor
并将您的清洁方法(的修改版本(交给它。查看此线程:如何将预处理程序传递给TfidfVectorizer?-sklearn-python。