我有一个数据集的格式为-
Movie_Name,番茄评论家,Target_Variable
在这里,TomatoCritics
属性包含来自不同用户的不同电影的自由文本。Target_Variable
是一个二进制值(0 或 1),表示是否应该观看这部电影。
我正在使用TF-IDF来处理这个问题,我的代码如下-
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
# Read textual training data-
text_training = pd.read_csv("Textual-Training_Data.csv")
# Read textual testing data-
text_testing = pd.read_csv("Textual-Testing_Data.csv")
# Get dimensions of training data-
text_training.shape
# (95, 3)
# Get dimensions of testing data-
text_testing.shape
# (224, 3)
# Check for missing values in training data-
text_training.isnull().values.any()
# True
# Check for missing values in testing data-
text_testing.isnull().values.any()
# True
# Remove any row having missing value from training data-
text_training_nona = text_training.dropna(axis = 0, how='any')
# Remove any row having missing value from testing data-
text_testing_nona = text_testing.dropna(axis = 0, how = 'any')
# Get dimensions of training data AFTER removing empty rows-
text_training_nona.shape
# (73, 3)
# Get dimensions of testing data AFTER removing empty rows-
text_testing_nona.shape
# (158, 3)
# Attributes to use for training and testing sets for ML-
cols_train = ['tomatoConsensus', 'goodforairplanes']
cols_test = ['tomatoConsensus', 'goodforairplanes']
# Split training dataset into features (X) and label (y) for training-
X_train = text_training_nona['tomatoConsensus']
y_train = text_training_nona['goodforairplanes']
# Split training dataset into features (X) and label (y) for testing-
X_test = text_testing_nona["tomatoConsensus"]
y_test = text_testing_nona['goodforairplanes']
# Initialize Count Vectorizer using TF-IDF ->
cv = TfidfVectorizer(min_df = 1, stop_words='english')
# Convert text to TF-IDF ->
X_train_cv = cv.fit_transform(X_train)
X_test_cv = cv.fit_transform(X_test)
# Multinomial Naive Bayes classifier-
mnb = MultinomialNB()
# Train model on training data-
mnb.fit(X_train_cv, y_train)
print(X_test_cv[0])
'''
(0, 1168) 0.20066499253877468
(0, 31) 0.2419027475877309
(0, 1090) 0.22790133982975397
(0, 5) 0.2616366234663056
(0, 877) 0.2616366234663056
(0, 1279) 0.2419027475877309
(0, 850) 0.1786670002268731
(0, 1341) 0.2616366234663056
(0, 2) 0.2616366234663056
(0, 695) 0.2616366234663056
(0, 1221) 0.2419027475877309
(0, 884) 0.1786670002268731
(0, 1070) 0.2616366234663056
(0, 782) 0.2616366234663056
(0, 252) 0.20066499253877468
(0, 1259) 0.2419027475877309
(0, 1093) 0.20816746395117927
(0, 122) 0.2170410042381541
'''
y_pred = mnb.predict(X_test_cv[0])
最后一行使用 mnb.predict()
给出错误-
出了值错误:尺寸不匹配
什么问题?
谢谢!
fit_transform
一次,然后使用现有的cv
和训练cv
对象进行转换。改变
X_train_cv = cv.fit_transform(X_train)
X_test_cv = cv.fit_transform(X_test)
到
X_train_cv = cv.fit_transform(X_train)
X_test_cv = cv.transform(X_test)
- 这应该可以解决您的问题。
如果您使用其他数据再次调用fit_transofrm
,它可能包含另一个数量的唯一单词,并且它将产生另一个大小的词汇表 - 然后,用其他数据训练的mnb
维度和其他大小的 vaocabulary 将不同 - 这就是 ValueError:维度不匹配。
编辑
只需检查两种情况的X_test_cv
和X_train_cv
- 如果您fit_transform
X_train
和 X_test
,它会给出不同的形状,但如果您替换第二个fit_transform fot 变换 - 它们将是相同的。