在Spark Java中通过Word Count分发文本

我是新手，对不起，如果这个问题对您来说很容易。我正在尝试提出类似火花的解决方案，但无法找到这样做的方法。

我的数据集看起来如下：

+----------------------+
|input                 |
+----------------------+
|debt ceiling          |
|declaration of tax    |
|decryption            |
|sweats                |
|ladder                |
|definite integral     |

我需要按长度计算行分布，例如：

1st选项：

500行包含1个和更多单词
120行包含2个和更多单词
70行包含2个和更多单词

第二选项：

300行包含1个字
250行包含2个单词
220行包含3个单词
270行包含4个和更多单词

是否有可能使用Java Spark功能的解决方案？我所能想到的，就是写某种UDF，它会有一个广播的计数器，但是我可能会缺少一些东西，因为应该有更好的方法来在Spark中进行此操作。

欢迎来到so！

这是Scala中的解决方案，您可以轻松适应Java。

val df = spark.createDataset(Seq(
  "debt ceiling", "declaration of tax", "decryption", "sweats"
)).toDF("input")
df.select(size(split('input, "\s+")).as("words"))
  .groupBy('words)
  .count
  .orderBy('words)
  .show

这会产生

+-----+-----+
|words|count|
+-----+-----+
|    1|    2|
|    2|    1|
|    3|    1|
+-----+-----+

相关内容

最新更新

热门标签：