假设我有一个数据帧df1;颜色";包含一束颜色的另一个数据帧df2和具有列"0"的另一数据帧df1;短语";包含各种短语。
我想加入两个数据帧,其中d1中的颜色出现在d2中的短语中。我不能使用d1.join(d2, d2("phrases").contains(d1("color"))
,因为它会连接到短语中单词出现的任何位置。我不想匹配scaRED这样的单词,例如,RED是另一个单词的一部分。我只想在颜色作为单独的单词出现在短语中时加入。
我可以用正则表达式来解决这个问题吗?当我需要引用表达式中的列时,我可以使用什么函数?语法如何?
您可以创建一个REGEX模式,在匹配colors
时检查单词边界(b
(,并使用regexp_replace
检查作为join
条件:
val df1 = Seq(
(1, "red"), (2, "green"), (3, "blue")
).toDF("id", "color")
val df2 = Seq(
"red apple", "scared cat", "blue sky", "green hornet"
).toDF("phrase")
val patternCol = concat(lit("\b"), df1("color"), lit("\b"))
df1.join(df2, regexp_replace(df2("phrase"), patternCol, lit("")) =!= df2("phrase")).
show
// +---+-----+------------+
// | id|color| phrase|
// +---+-----+------------+
// | 1| red| red apple|
// | 3| blue| blue sky|
// | 2|green|green hornet|
// +---+-----+------------+
注意,";害怕的猫;如果没有所附的单词边界,这将是一个匹配。
在构建自己的解决方案时,您也可以尝试以下方法:
d1.join(d2, array_contains(split(d2("phrases"), " "), d1("color")))
没有看到您的数据,但这只是一个开始,有一些变化。就我所见,不需要regex,但谁知道呢:
// You need to do some parsing like stripping of . ? and may be lowercase or uppercase
// You did not provide an example on the JOIN
import org.apache.spark.sql.functions._
import scala.collection.mutable.WrappedArray
val checkValue = udf { (array: WrappedArray[String], value: String) => array.iterator.map(_.toLowerCase).contains(value.toLowerCase() ) }
//Gen some data
val dfCompare = spark.sparkContext.parallelize(Seq("red", "blue", "gold", "cherry")).toDF("color")
val rdd = sc.parallelize( Array( (("red","hello how are you red",10)), (("blue", "I am fine but blue",20)), (("cherry", "you need to do some parsing and I like cherry",30)), (("thebluephantom", "you need to do some parsing and I like fanta",30)) ))
//rdd.collect
val df = rdd.toDF()
val df2 = df.withColumn("_4", split($"_2", " "))
df2.show(false)
dfCompare.show(false)
val res = df2.join(dfCompare, checkValue(df2("_4"), dfCompare("color")), "inner")
res.show(false)
退货:
+------+---------------------------------------------+---+--------------------------------------------------------+------+
|_1 |_2 |_3 |_4 |color |
+------+---------------------------------------------+---+--------------------------------------------------------+------+
|red |hello how are you red |10 |[hello, how, are, you, red] |red |
|blue |I am fine but blue |20 |[I, am, fine, but, blue] |blue |
|cherry|you need to do some parsing and I like cherry|30 |[you, need, to, do, some, parsing, and, I, like, cherry]|cherry|
+------+---------------------------------------------+---+--------------------------------------------------------+------+