在scikit-learn中,如何对scipy中已经存在的数据运行HashingVectorizer。稀疏矩阵?
我的数据是svmlight格式,所以我用sklearn.datasets加载它。Load_svmlight_file并获取一个scipy。稀疏矩阵。
从scikit-learn的TfidfTransformer可以馈送这样一个稀疏矩阵来转换它,但是我怎么能给HashingVectorizer相同的稀疏矩阵来使用它呢?
编辑:是否可能有一系列的方法调用,可以在稀疏矩阵上使用,也许使用FeatureHasher?
编辑2:在与下面的用户cfh进行了有益的讨论之后,我的目标是从输入:从svlight数据得到的稀疏计数矩阵到输出:标记出现的矩阵,如HashingVectorizer给出的。这是怎么做到的呢?
我在下面提供了一个示例代码,如果你能帮助我如何做到这一点,我真的很感激,提前感谢:
from sklearn.feature_extraction.text import TfidfTransformer
import numpy as np
from sklearn.feature_extraction.text import HashingVectorizer
from scipy.sparse import csr_matrix
# example data
X_train = np.array([[1., 1.], [2., 3.], [4., 0.]])
print "X_train: n", X_train
# transform to scipy.sparse.csr.csr_matrix to be consistent with output from load_svmlight_file
X_train_crs = csr_matrix(X_train)
print "X_train_crs: n", X_train_crs
# no problem to run TfidfTransformer() on this csr matrix to get a transformed csr matrix
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(X_train)
print "tfidf: n", tfidf
# How do I use the HashingVectorizer with X_train_crs ?
hv = HashingVectorizer(n_features=2)
哈希基本上是将单词随机组合到较小数量的桶中。对于已经计算过的频率矩阵,可以这样模拟:
n_features = X_train.shape[1]
n_desired_features = n_features / 5
buckets = np.random.random_integers(0, n_desired_features-1, size=n_features)
X_new = np.zeros((X_train.shape[0], n_desired_features), dtype=X_train.dtype)
for i in range(n_features):
X_new[:,buckets[i]] += X_train[:,i]
当然你可以根据自己的意愿调整n_desired_features
。只要确保对测试数据也使用相同的buckets
即可。
如果你需要对稀疏矩阵做同样的事情,你可以这样做:
M = coo_matrix((repeat(1,n_features), (range(n_features), buckets)),
shape=(n_features,n_desired_features))
X_new = X_train.dot(M)