PySpark 中的聚合稀疏向量

>我有一个Hive表，其中包含与每个文档关联的文本数据和一些元数据。看起来像这样。

from pyspark.ml.feature import Tokenizer
from pyspark.ml.feature import CountVectorizer
df = sc.parallelize([
  ("1", "doc_1", "fruit is good for you"),
  ("2", "doc_2", "you should eat fruit and veggies"),
  ("2", "doc_3", "kids eat fruit but not veggies")
]).toDF(["month","doc_id", "text"])

+-----+------+--------------------+
|month|doc_id|                text|
+-----+------+--------------------+
|    1| doc_1|fruit is good for...|
|    2| doc_2|you should eat fr...|
|    2| doc_3|kids eat fruit bu...|
+-----+------+--------------------+

我想按月数字数。到目前为止，我已经采用了CountVectorizer方法：

tokenizer = Tokenizer().setInputCol("text").setOutputCol("words")
tokenized = tokenizer.transform(df)
cvModel = CountVectorizer().setInputCol("words").setOutputCol("features").fit(tokenized)
counted = cvModel.transform(tokenized)

+-----+------+--------------------+--------------------+--------------------+
|month|doc_id|                text|               words|            features|
+-----+------+--------------------+--------------------+--------------------+
|    1| doc_1|fruit is good for...|[fruit, is, good,...|(12,[0,3,4,7,8],[...|
|    2| doc_2|you should eat fr...|[you, should, eat...|(12,[0,1,2,3,9,11...|
|    2| doc_3|kids eat fruit bu...|[kids, eat, fruit...|(12,[0,1,2,5,6,10...|
+-----+------+--------------------+--------------------+--------------------+

现在我想按月分组并返回如下所示的内容：

month  word   count
1      fruit  1
1      is     1
...
2      fruit  2
2      kids   1
2      eat    2
...

我该怎么做？

没有

用于Vector * 聚合的内置机制，但这里不需要。获得标记化数据后，您只需explode并聚合：

from pyspark.sql.functions import explode
(counted
    .select("month", explode("words").alias("word"))
    .groupBy("month", "word")
    .count())

如果您希望将结果限制为vocabulary只需添加一个过滤器：

from pyspark.sql.functions import col
(counted
    .select("month", explode("words").alias("word"))
    .where(col("word").isin(cvModel.vocabulary))
    .groupBy("month", "word")
    .count())

* 从 Spark 2.4 开始，我们可以访问Summarizer但在这里没有用。

相关内容

最新更新

热门标签：