为什么我的scikit学习HashingVectorizor给我的浮点设置为binary=True

我正在尝试使用scikit-learn的伯努利-奈夫贝叶斯分类器。我使用CountVectorizor让分类器在一个小数据集上工作得很好，但当我试图使用HashingVectorizor处理更大的数据集时遇到了麻烦。将所有其他参数(训练文档、测试文档、分类器和特征提取器设置)保持不变，并从CountVectorizor切换到HashingVectoriztor，导致我的分类器总是为所有文档吐出相同的标签。

我写了以下脚本来研究两个特征提取器之间的区别：

from sklearn.feature_extraction.text import HashingVectorizer, CountVectorizer
cv = CountVectorizer(binary=True, decode_error='ignore')
h = HashingVectorizer(binary=True, decode_error='ignore')
with open('moby_dick.txt') as fp:
doc = fp.read()
cv_result = cv.fit_transform([doc])
h_result = h.transform([doc])
print cv_result
print repr(cv_result)
print h_result
print repr(h_result)

(其中"moby_dick.txt"是moby-dick的gutenberg项目副本)

(浓缩)结果：

(0, 17319)    1
(0, 17320)    1
(0, 17321)    1
<1x17322 sparse matrix of type '<type 'numpy.int64'>'
with 17322 stored elements in Compressed Sparse Column format>
(0, 1048456)  0.00763203138591
(0, 1048503)  0.00763203138591
(0, 1048519)  0.00763203138591
<1x1048576 sparse matrix of type '<type 'numpy.float64'>'
with 17168 stored elements in Compressed Sparse Row format>

正如你所看到的，CountVectorizor在二进制模式下，为每个特征的值返回整数1(我们只希望看到1，因为只有一个文档)；另一方面，HashVectorizor返回浮点值(都是一样的，但不同的文档会产生不同的值)。我怀疑我的问题源于将这些浮子传递给伯努利NB。

理想情况下，我希望有一种方法可以从HashingVectorizor获得与从CountVectorizzor相同的二进制格式数据；如果不能做到这一点，如果我知道为这些数据设置一个合理的阈值，我可以使用BernoulliNB二进制化参数，但我不清楚这些浮点数代表什么(它们显然不是令牌计数，因为它们都相同且小于1)。

如有任何帮助，我们将不胜感激。

在默认设置下，HashingVectorizer将特征向量标准化为单位欧几里得长度：

>>> text = "foo bar baz quux bla"
>>> X = HashingVectorizer(n_features=8).transform([text])
>>> X.toarray()
array([[-0.57735027,  0.        ,  0.        ,  0.        ,  0.57735027,
0.        , -0.57735027,  0.        ]])
>>> scipy.linalg.norm(np.abs(X.toarray()))
1.0

设置binary=True仅将该归一化推迟到对特征进行二值化之后，即将所有非零1设置为1。您还必须设置norm=None才能将其关闭：

>>> X = HashingVectorizer(n_features=8, binary=True).transform([text])
>>> X.toarray()
array([[ 0.5,  0. ,  0. ,  0. ,  0.5,  0.5,  0.5,  0. ]])
>>> scipy.linalg.norm(X.toarray())
1.0
>>> X = HashingVectorizer(n_features=8, binary=True, norm=None).transform([text])
>>> X.toarray()
array([[ 1.,  0.,  0.,  0.,  1.,  1.,  1.,  0.]])

这也是它返回float数组的原因：规范化需要它们。虽然矢量器可以被操纵以返回另一个数据类型，但这将需要在transform方法内部进行转换，并且可能需要一个返回以在下一个估计器中浮动。

要用HashingVectorizer替换CountVectorizer(binary=True)，正确的参数为：norm=None(默认值"l2")、alternate_sign=False(默认值True)和binary=True(默认值False)。

但是，如果您需要与CountVectorizer具有相同dtype的输出，则可以指定dtype="int64"(默认为"float64")。

此外，当二进制=True时，dtype="uint8"是最佳的数据类型，它将为您节省大量内存：

>>> from sklearn.feature_extraction.text import HashingVectorizer, CountVectorizer
>>> 
>>> cv = CountVectorizer(binary=True)
>>> hv = HashingVectorizer(norm=None, alternate_sign=False, binary=True, dtype='uint8')
>>> 
>>> doc = "one two three two one"
>>> cv_result = cv.fit_transform([doc])
>>> hv_result = hv.transform([doc])
>>> 
>>> print(repr(cv_result))
<1x3 sparse matrix of type '<class 'numpy.int64'>'
with 3 stored elements in Compressed Sparse Row format>
>>> print(cv_result)
(0, 0)    1
(0, 2)    1
(0, 1)    1
>>> print(f'used: {(cv_result.data.nbytes + cv_result.indptr.nbytes + cv_result.indices.nbytes)} bytesn')
used: 44 bytes
>>> 
>>> print(repr(hv_result))
<1x1048576 sparse matrix of type '<class 'numpy.uint8'>'
with 3 stored elements in Compressed Sparse Row format>
>>> print(hv_result)
(0, 824960)   1
(0, 884299)   1
(0, 948532)   1
>>> print(f'used: {(hv_result.data.nbytes + hv_result.indptr.nbytes + hv_result.indices.nbytes)} bytes')
used: 23 bytes

相关内容

最新更新

热门标签：