无法在逻辑回归中使用decision_function()评估分数



我正在做华盛顿大学的这项任务,我必须使用LogisticRegression中的decision_function()来预测sample_test_matrix(最后几行)的分数。但是我得到的错误是

ValueError: X has 145 features per sample; expecting 113092

这是代码:

import pandas as pd 
import numpy as np 
from sklearn.linear_model import LogisticRegression
products = pd.read_csv('amazon_baby.csv')
def remove_punct (text) :
import string 
text = str(text)
for i in string.punctuation:
text = text.replace(i,"")
return(text)
products['review_clean'] = products['review'].apply(remove_punct)
products = products[products.rating != 3]
products['sentiment'] = products['rating'].apply(lambda x : +1 if x > 3 else  -1 )
train_data_index = pd.read_json('module-2-assignment-train-idx.json')
test_data_index = pd.read_json('module-2-assignment-test-idx.json')
train_data = products.loc[train_data_index[0], :]
test_data = products.loc[test_data_index[0], :]
train_data = train_data.dropna()
test_data = test_data.dropna()
from sklearn.feature_extraction.text import CountVectorizer
train_matrix = vectorizer.fit_transform(train_data['review_clean'])
test_matrix = vectorizer.fit_transform(test_data['review_clean'])
sentiment_model = LogisticRegression()
sentiment_model.fit(train_matrix, train_data['sentiment'])
print (sentiment_model.coef_)
sample_data = test_data[10:13]
print (sample_data)
sample_test_matrix = vectorizer.transform(sample_data['review_clean'])
scores = sentiment_model.decision_function(sample_test_matrix)
print (scores)

以下是产品数据:

Name                                                         Review                                       Rating  
0       Planetwise Flannel Wipes                              These flannel wipes are OK, but in my opinion ...       3  

1       Planetwise Wipe Pouch                                 it came early and was not disappointed. i love...       5  

2       Annas Dream Full Quilt with 2 Shams                   Very soft and comfortable and warmer than it l...       5  
3       Stop Pacifier Sucking without tears with Thumb...     This is a product well worth the purchase.  I ...       5
4       Stop Pacifier Sucking without tears with Thumb...      All of my kids have cried non-stop when I trie...       5 

这一行导致后续行出现错误:

test_matrix = vectorizer.fit_transform(test_data['review_clean'])

将以上内容更改为:

test_matrix = vectorizer.transform(test_data['review_clean'])

解释:使用fit_transform()将在测试数据上重新安装CountVectorizer。因此,有关训练数据的所有信息都将丢失,词汇将仅根据测试数据计算。

然后使用该vectorizer对象来变换sample_data['review_clean']。因此,其中的特征将仅是从test_data中学习到的特征。

sentiment_model是根据train_data的词汇进行训练的。因此,这些特征是不同的。

在测试数据上始终使用transform(),而不是fit_transform()

相关内容

  • 没有找到相关文章

最新更新