假设您有一个带有多个列的Spark DataFrame,您想返回列包含特定字符的行。具体来说,您要返回至少一个字段包含(),[]%或 的行,而行。如果要使用Spark SQL Rlike函数,则适当的语法是什么?
import spark.implicits._
val dummyDf = Seq(("John[", "Ha", "Smith?"),
("Julie", "Hu", "Burol"),
("Ka%rl", "G", "Hu!"),
("(Harold)", "Ju", "Di+")
).toDF("FirstName", "MiddleName", "LastName")
dummyDf.show()
+---------+----------+--------+
|FirstName|MiddleName|LastName|
+---------+----------+--------+
| John[| Ha| Smith?|
| Julie| Hu| Burol|
| Ka%rl| G| Hu!|
| (Harold)| Ju| Di+|
+---------+----------+--------+
Expected Output
+---------+----------+--------+
|FirstName|MiddleName|LastName|
+---------+----------+--------+
| John[| Ha| Smith?|
| Ka%rl| G| Hu!|
| (Harold)| Ju| Di+|
+---------+----------+--------+
我的几次尝试返回错误,即使我只是尝试进行搜索,
我知道我可以多次使用简单的构造,但是我试图用Regex和Spark SQL以更简洁的方式进行操作。
您可以使用 rlike
方法尝试:
dummyDf.show()
+---------+----------+--------+
|FirstName|MiddleName|LastName|
+---------+----------+--------+
| John[| Ha| Smith?|
| Julie| Hu| Burol|
| Ka%rl| G| Hu!|
| (Harold)| Ju| Di+|
| +Tim| Dgfg| Ergf+|
+---------+----------+--------+
val df = dummyDf.withColumn("hasSpecial",lit(false))
val result = df.dtypes
.collect{ case (dn, dt) => dn }
.foldLeft(df)((accDF, c) => accDF.withColumn("hasSpecial", col(c).rlike(".*[\(\)\[\]%+]+.*") || col("hasSpecial")))
result.filter(col("hasSpecial")).show(false)
输出:
+---------+----------+--------+----------+
|FirstName|MiddleName|LastName|hasSpecial|
+---------+----------+--------+----------+
|John[ |Ha |Smith? |true |
|Ka%rl |G |Hu! |true |
|(Harold) |Ju |Di+ |true |
|+Tim |Dgfg |Ergf+ |true |
+---------+----------+--------+----------+
如果需要的话,也可以丢弃hasSpecial column
。
尝试此.*[()[]%+,.]+.*
。*所有字符零或更多次
[()[]% ,。] 支架中的所有字符1或更多次
。*所有字符零或更多次