Sparkit Learn的怪异转置行为

我使用Sparkit Learn的SparkCountVectorizer和SparkTfidfVectorize器将一堆文档转换为TFIDF矩阵。

我创建了TFIDF矩阵，它具有正确的维度(496861个文档乘以189398个不同的令牌)：

>>> tfidf
<class 'splearn.rdd.SparseRDD'> from PythonRDD[20] at RDD at PythonRDD.scala:48
>>> tfidf.shape
(496861, 189398)

对单个矢量进行切片将返回正确的输出(1个文档乘以189398个不同的标记)：

>>> tfidf.flatMap(lambda x: x).take(1)
[<1x189398 sparse matrix of type '<class 'numpy.float64'>'
with 49 stored elements in Compressed Sparse Row format>]

现在我想得到每个文档的转置(即，维度为189398乘1的向量)：

>>> tfidf.flatMap(lambda x: x.T).take(1)

但我得到的却是：

[<1x7764 sparse matrix of type '<class 'numpy.float64'>'
with 77 stored elements in Compressed Sparse Row format>]

所以，我得到的不是189389x1的向量，而是1x7764的向量。我理解7764：当我读取数据时，我将其.repartition()分为64块，结果是496861(文档数量)除以64等于7763.4。我不明白的是，为什么Sparkit Learn在一种情况下(lambda x: x)按平行迭代，而在另一种情况(lambda x: x.T)按分区迭代。我完全糊涂了。

如果重要的话，我的最终目标是过滤TFIDF矩阵，这样我只得到某些列中具有非零值的向量(即，只得到包含某些单词的文档)，并且索引未转换的1x189389向量不起作用(无论我在x之后放了多少个[0]，我总是得到相同的1x189388向量)。

你转置了一个错误的东西。splearn.rdd.SparseRDD存储数据块，因此可以转置块而不是单个矢量。如果块有7764行和18938列，则转置块有18938行和7764列，当变平时，这些行和列将逐行迭代。

您需要的是：

(tfidf
# Iterate over each block and yield the rows
# block-size x 18938 -> 1 x 18938
.flatMap(lambda x: x) 
# Take each row and transpose it
# 1 x 18938 -> 1 x 18938
.map(lambda x: x.T))

或

(tfidf
# Iterate over each row in block (generator expression) 
# and transpose it block-size x 18938 -> block-size x 18938 x 1
#
# and then flatten (with flatMap) yielding rows 18938 x 1
.flatMap(lambda xs: (x.T for x in xs)))

注意：我对Sparkit learn并不是很熟悉，所以可以有一些更优雅的解决方案。

相关内容

最新更新

热门标签：