pyspark 数据帧:删除数组列中的重复项

我想删除pyspark数据帧列中的一些重复单词。

基于从 PySpark 数组列中删除重复项

列我的火花：

2.4.5

Py3 代码：

test_df = spark.createDataFrame([("I like this Book and this book be DOWNLOADED on line",)], ["text"])
t3 = test_df.withColumn("text", F.array("text")) # have to convert it to array because the original large df is array type.
t4 = t3.withColumn('text', F.expr("transform(text, x -> lower(x))"))
t5 = t4.withColumn('text', F.array_distinct("text"))
t5.show(1, 120)

但得到了

+--------------------------------------------------------+
|                                                    text| 
+--------------------------------------------------------+
|[i like this book and this book be downloaded on line]|
+--------------------------------------------------------+

我需要删除

book and this

似乎"array_distinct"无法过滤掉它们？

谢谢

你可以使用 pyspark 中的 lcase 、split 、array_distinct 和 array_join 函数sql.functions

例如，F.expr("array_join(array_distinct(split(lcase(text),' ')),' ')")

这是工作代码

import pyspark.sql.functions as F
df
.withColumn("text_new",
F.expr("array_join(array_distinct(split(lcase(text),' ')),' ')")) 
.show(truncate=False)

解释：

在这里，您首先使用lcase(text)将 everthing 转换为小写，然后用split(text,' ')拆分空格上的数组，从而产生

[i, like, this, book, and, this, book, be, downloaded, on, line]|

然后你把它传递给array_distinct，它产生

[i, like, this, book, and, be, downloaded, on, line]

最后，使用array_join将其与空格连接

起来

i like this book and be downloaded on line

相关内容

最新更新

热门标签：