Spark (scala) - 迭代 DF 列并计算一组项目中的匹配项数 - Spark (scala) - Iterate over DF column and count number of matches from a set of items 小贝子编程网

所以我现在可以迭代数据帧中的一列字符串，并检查是否有任何字符串包含大型字典中的任何项目(参见这里，感谢@raphael-roth和@tzach-zohar(。为此，基本的 udf(不包括广播dict列表(是：

val checkerUdf = udf { (s: String) => dict.exists(s.contains(_)) }
df.withColumn("word_check", checkerUdf($"words")).show()

我尝试做的下一件事也是以最有效的方式计算dict集中发生的匹配数量(我正在处理非常大的数据集和dict文件(。

我一直在尝试在udf中使用findAllMatchIn，同时使用count和map：

val checkerUdf = udf { (s: String) => dict.count(_.r.findAllMatchIn(s))
// OR
val checkerUdf = udf { (s: String) => dict.map(_.r.findAllMatchIn(s))

但这返回了一个迭代器列表(空和非空(，我得到一个类型不匹配(找到迭代器，需要布尔值(。我不确定如何计算非空迭代器(count和size和length不起作用(。

知道我做错了什么吗？有没有更好/更有效的方法来实现我想要做的事情？

您可以将其他问题的答案更改为

import org.apache.spark.sql.functions._
val checkerUdf = udf { (s: String) => dict.count(s.contains(_)) }
df.withColumn("word_check", checkerUdf($"words")).show()

鉴于dataframe为

+---+---------+
|id |words    |
+---+---------+
|1  |foo      |
|2  |barriofoo|
|3  |gitten   |
|4  |baa      |
+---+---------+

和字典文件作为

val dict = Set("foo","bar","baaad")

您应该将输出为

+---+---------+----------+
| id|    words|word_check|
+---+---------+----------+
|  1|      foo|         1|
|  2|barriofoo|         2|
|  3|   gitten|         0|
|  4|      baa|         0|
+---+---------+----------+

我希望答案对您有所帮助

Spark (scala) - 迭代 DF 列并计算一组项目中的匹配项数

相关内容

最新更新

热门标签：