sklearn model.predict在使用kf.split拆分后预测错误的形状



我试图使用sklearn预测我的文本字符串模型,代码在下面

from sklearn import datasets
news = datasets.load_files("dataset-news", encoding='latin1', categories=categories)
def vectorize_data(data):
count_vect = CountVectorizer()
return count_vect.fit_transform(data)
# Gaussian naive Bayes
def gaussian_train(train, target):
gnb = GaussianNB()
gnb.fit(train, target)
return gnb
kf = KFold(n_splits=5)
counter = 1
for train_idx, test_idx in kf.split(news.data):
print ("%d Fold" % counter)
train_data = vectorize_data(np.array(news.data)[train_idx])
test_data = vectorize_data(np.array(news.data)[test_idx])
print("Gaussian naive Bayes")
print(train_data.shape)
print(test_data.shape)
g_model_train = gaussian_train(train_data.toarray(), news.target[train_idx])
# predict_data(g_model_fold, test_data.toarray(), target_data)
# Predict unseen test data based on fitted classifer
predicted = g_model_fold.predict(test_data.toarray())

从我的控制台

1 Fold
Gaussian naive Bayes
(640, 13477)
(161, 5193)

但后来我得到了

ValueError: operands could not be broadcast together with shapes (161,5193) (14214,) 

如何解决此问题?

当您将文本转换为标记计数时,所使用的功能应该相同,以便矩阵具有相同的列数。一种选择是从列车数据中返回计数矢量器,并将其用于测试数据。因此,我们设置了vectorize_data((函数:

from sklearn import datasets
from sklearn.model_selection import KFold
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import GaussianNB
def vectorize_data(data):
count_vect = CountVectorizer()
return count_vect.fit(data)

使用示例数据集:

categories = ['alt.atheism', 'sci.space']
news = datasets.fetch_20newsgroups(categories=categories)

运行kfold:

kf = KFold(n_splits=5)
for train_idx, test_idx in kf.split(news.data):

cvect = vectorize_data(np.array(news.data)[train_idx])
train_data = cvect.transform(np.array(news.data)[train_idx])
test_data = cvect.transform(np.array(news.data)[test_idx])
print("Gaussian naive Bayes")
print(train_data.shape)
print(test_data.shape)
g_model_train = gaussian_train(train_data.toarray(), news.target[train_idx])
# predict_data(g_model_fold, test_data.toarray(), target_data)
# Predict unseen test data based on fitted classifer
predicted = g_model_train.predict(test_data.toarray())

输出:

Gaussian naive Bayes
(858, 20415)
(215, 20415)
Gaussian naive Bayes
(858, 20019)
(215, 20019)
Gaussian naive Bayes
(858, 20094)
(215, 20094)
Gaussian naive Bayes
(859, 20119)
(214, 20119)
Gaussian naive Bayes
(859, 20207)
(214, 20207)

最新更新