我正在尝试使用"Sci-kit"学习单词袋进行文本分类器。矢量化为分类器。然而,我想知道除了文本本身之外,我该如何向输入中添加另一个变量。假设我想在文本之外添加一些单词(因为我认为这可能会影响结果)。我该怎么做
我必须在那个分类器的上面添加另一个分类器吗?或者有没有一种方法可以将输入添加到矢量化文本中?
Scikit学习分类器与numpy数组一起工作。这意味着,在对文本进行矢量化之后,您可以很容易地将新功能添加到这个数组中(我收回这句话,虽然不太容易,但可行)。问题是在文本分类中,您的功能将是稀疏的,因此正常的numpy列添加不起作用。
代码修改自scikit学习scipy 2013教程中的文本挖掘示例。
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
import numpy as np
import scipy
# Load the text data
twenty_train_subset = load_files('datasets/20news-bydate-train/',
categories=categories, encoding='latin-1')
# Turn the text documents into vectors of word frequencies
vectorizer = TfidfVectorizer(min_df=2)
X_train_only_text_features = vectorizer.fit_transform(twenty_train_subset.data)
print type(X_train_only_text_features)
print "X_train_only_text_features",X_train_only_text_features.shape
size = X_train_only_text_features.shape[0]
print "size",size
ones_column = np.ones(size).reshape(size,1)
print "ones_column",ones_column.shape
new_column = scipy.sparse.csr.csr_matrix(ones_column )
print type(new_column)
print "new_column",new_column.shape
X_train= scipy.sparse.hstack([new_column,X_train_only_text_features])
print "X_train",X_train.shape
输出如下:
<class 'scipy.sparse.csr.csr_matrix'>
X_train_only_text_features (2034, 17566)
size 2034
ones_column (2034L, 1L)
<class 'scipy.sparse.csr.csr_matrix'>
new_column (2034, 1)
X_train (2034, 17567)