我得到ValueError: X每个样本有5851个特征;将线性SVC模型应用于测试集时期望2754



我正在尝试使用线性SVC对文本进行分类,但我得到了一个错误。

我对测试集应用了如下的模型。在这段代码中,我创建了Tfidf,并对训练集进行了过采样。

#Import datasets
train = pd.read_csv('train_labeled.csv')
test = pd.read_csv('test.csv')
#Clean datasets
custom_pipeline = [preprocessing.fillna,
preprocessing.lowercase,
preprocessing.remove_whitespace,
preprocessing.remove_punctuation,
preprocessing.remove_urls,
preprocessing.remove_digits,
preprocessing.stem  
]

train["clean_text"] = train["text"].pipe(hero.clean, custom_pipeline)
test["clean_text"] = test["text"].pipe(hero.clean, custom_pipeline)
#Create Tfidf
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(train["clean_text"])
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_test_counts = count_vect.fit_transform(test["clean_text"])
X_test_tfidf = tfidf_transformer.fit_transform(X_test_counts)
#Oversampling of trainig set
over = RandomOverSampler(sampling_strategy='minority')
X_os, y_os = over.fit_resample(X_train_tfidf, train["label"])
#Model
clf = svm.LinearSVC(C=1.0, penalty='l2', loss='squared_hinge', dual=True, tol=1e-3)
clf.fit(X_os, y_os)
pred = clf.predict(X_test_tfidf)

得到了这样的错误。我认为这是因为测试集有5851个样本,而训练集有2754个样本。

ValueError: X has 5851 features per sample; expecting 2754

在这种情况下,我应该怎么做?

不要在测试数据上调用fit_transform(),因为转换器将学习新的词汇表,并且不会像转换训练数据那样转换测试数据。要使用与训练数据相同的词汇表,在测试数据上只使用transform():

# initialize transformers
count_vect = CountVectorizer()
tfidf_transformer = TfidfTransformer()
# fit and transform train data
X_train_counts = count_vect.fit_transform(train["clean_text"])
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
# transform test data
X_test_counts = count_vect.transform(test["clean_text"])
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

注意

如果您不需要CountVectorizer的输出,您可以使用TfidfVectorizer来减少编写的代码量:

tfidf_vect = TfidfVectorizer()
X_train_tfidf = tfidf_vect.fit_transform(train["clean_text"])
X_test_tfidf = tfidf_vect.transform(test["clean_text"])

相关内容

  • 没有找到相关文章

最新更新