通过先前训练的模型预测看不见的数据



我正在使用Scikit learn执行监督机器学习。我有两个数据集。第一个数据集包含具有X个特征和Y个标签的数据。第二个数据集只包含X个特征,但没有Y个标签。我可以成功地为训练/测试数据执行LinearSVC,并获得测试数据集的Y标签。

现在,我想使用我为第一个数据集训练的模型来预测第二个数据集标签。如何在Scikit learn中使用从第一个数据集到第二个数据集(看不见的标签(的预训练模型?

我尝试的代码片段:更新了下面评论中的代码:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import pandas as pd
import pickle

# ----------- Dataset 1: for training ----------- #
# Sample data ONLY
some_text = ['Books are amazing',
'Harry potter book is awesome. It rocks',
'Nutrition is very important',
'Welcome to library, you can find as many book as you like',
'Food like brocolli has many advantages']
y_variable = [1,1,0,1,0]
# books = 1 : y label
# food = 0 : y label
df = pd.DataFrame({'text':some_text,
'y_variable': y_variable
})
# ------------- TFIDF process -------------#
tfidf = TfidfVectorizer()
features = tfidf.fit_transform(df['text']).toarray()
labels = df.y_variable
features.shape

# ------------- Build Model -------------#
model = LinearSVC()
X_train, X_test, y_train, y_test= train_test_split(features,
labels,
train_size=0.5,
random_state=0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Export model
pickle.dump(model, open('model.pkl', 'wb'))
# Read the Model
model_pre_trained = pickle.load(open('model.pkl','rb'))

# ----------- Dataset 2: UNSEEN DATASET ----------- #
some_text2 = ['Harry potter books are amazing',
'Gluten free diet is getting popular']
unseen_df = pd.DataFrame({'text':some_text2}) # Notice this doesn't have y_variable. This the is the data set I am trying to predict y_variable labels 1 or 0.

# This is where the ERROR occurs
X_unseen = tfidf.fit_transform(unseen_df['text']).toarray()
y_pred_unseen = model_pre_trained.predict(X_unseen) # error here: 
# ValueError: X has 11 features per sample; expecting 26

print(X_unseen.shape) # prints (2, 11)
print(X_train.shape) # prints (2, 26)

# Looking for an output like this for UNSEEN data
# Looking for results after predicting unseen and no label data. 
text                                   y_variable
Harry potter books are amazing         1
Gluten free diet is getting popular    0

它不必像我上面尝试的那样是pickle代码。我在寻找是否有人有建议,或者是否有任何预构建函数可以从scikit中进行预测?

正如您所看到的,您的第一个tfidf将您的输入转化为26个特性,而您的第二个tfidf将它们转化为11个特性。由于CCD_ 3与CCD_。提示告诉您X_unseen中的每个观察所具有的特征少于model被训练接收的特征数量。

在第二个脚本中加载model后,您将为文本安装另一个矢量器。也就是说,来自第一脚本的tfidf和来自第二脚本的tfidf是不同的对象。为了使用model进行预测,需要使用原始tfidfX_unseen进行变换。为了做到这一点,您必须导出原始矢量器,将其加载到新脚本中,并在将新数据传递给model之前对其进行转换。

### Do this in the first program
# Dump model and tfidf
pickle.dump(model, open('model.pkl', 'wb'))
pickle.dump(tfidf, open('tfidf.pkl', 'wb'))
### Do this in the second program
model = pickle.load(open('model.pkl', 'rb'))
tfidf = pickle.load(open('tfidf.pkl', 'rb'))
# Use `transform` instead of `fit_transform`
X_unseen = tfidf.transform(unseen_df['text']).toarray()
# Predict on `X_unseen`
y_pred_unseen = model_pre_trained.predict(X_unseen)

想象一下,你训练一个人工智能使用发动机、轮子、机翼和飞行员领结的照片来识别飞机。现在,你调用了同样的人工智能,并要求它单独预测蝴蝶结飞机的模型。这就是scikit告诉你的:X_unseen中的特性(=列(比X_trainX_test中的要少得多。

忽略第二个数据集,使用train_testrongplit创建测试集。

最新更新