Spark 数据帧无法比较空值



嗨,我有 2 个数据帧,我正在比较数据帧的值,并基于将值分配给一个新数据帧的值。 所有场景都运行良好,期望空字段比较,即如果在两个数据帧值中都是空的,那么它应该显示为"varified",但它给我的是"未varified"我正在共享我正在使用的数据帧数据和代码以及下面最终数据帧的结果。

scala> df1.show()
+---+-----+---+--------+------+-------+
| id| name|age|lastname|  city|country|
+---+-----+---+--------+------+-------+
|  1|rohan| 26|  sharma|mumbai|  india|
|  2|rohan| 26|  sharma|  null|  india|
|  3|rohan| 26|    null|mumbai|  india|
|  4|rohan| 26|  sharma|mumbai|  india|
+---+-----+---+--------+------+-------+
scala> df2.show()
+----+------+-----+----------+------+---------+
|o_id|o_name|o_age|o_lastname|o_city|o_country|
+----+------+-----+----------+------+---------+
|   1| rohan|   26|    sharma|mumbai|    india|
|   2| rohan|   26|    sharma|  null|    india|
|   3| rohan|   26|    sharma|mumbai|    india|
|   4| rohan|   26|      null|mumbai|    india|
+----+------+-----+----------+------+---------+
val df3 = df1.join(df2, df1("id") === df2("o_id"))
.withColumn("result", when(df1("name") === df2("o_name") && 
df1("age") === df2("o_age") && 
df1("lastname") === df2("o_lastname") && 
df1("city") === df2("o_city")  &&
df1("country") === df2("o_country"), "Varified")
.otherwise("Not Varified")).show()
+---+-----+---+--------+------+-------+----+------+-----+----------+------+---------+------------+
| id| name|age|lastname|  city|country|o_id|o_name|o_age|o_lastname|o_city|o_country|      result|
+---+-----+---+--------+------+-------+----+------+-----+----------+------+---------+------------+
|  1|rohan| 26|  sharma|mumbai|  india|   1| rohan|   26|    sharma|mumbai|    india|    Varified|
|  2|rohan| 26|  sharma|  null|  india|   2| rohan|   26|    sharma|  null|    india|Not Varified|
|  3|rohan| 26|    null|mumbai|  india|   3| rohan|   26|    sharma|mumbai|    india|Not Varified|
|  4|rohan| 26|  sharma|mumbai|  india|   4| rohan|   26|      null|mumbai|    india|Not Varified|
+---+-----+---+--------+------+-------+----+------+-----+----------+------+---------+------------+

我希望对于 id '2',它也应该显示为"Varified".但是该城市在两列中都是空的,然后显示为"未认证"。 有人可以指导我应该如何修改我的 df3 查询,以便它也可以检查 null,对于 id "2"也可以在结果列中显示为"已验证"。

使用<=>而不是===

val df3 = df1.join(df2, df1("id") === df2("o_id"))
.withColumn("result", when(df1("name") <=> df2("o_name") && 
df1("age") <=> df2("o_age") && 
df1("lastname") <=> df2("o_lastname") && 
df1("city") <=> df2("o_city")  &&
df1("country") <=> df2("o_country"), "Varified")
.otherwise("Not Varified")).show()
spark.sql("SELECT NULL AS city1, NULL AS city2").select($"city1" <=> $"city2").show

结果

+-----------------+
|(city1 <=> city2)|
+-----------------+
|            true |
+-----------------+

when+otherwise语句中添加<=>(或(||运算符,并检查.isNull以获取last_name and city列。

null=null返回null我们无法匹配的原因。

spark.sql("select null=null").show()
//+-------------+
//|(NULL = NULL)|
//+-------------+
//|         null|
//+-------------+

Using <=>,isnull():

spark.sql("select null<=>null, isnull(null) = isnull(null)").show()
//+---------------+---------------------------------+
//|(NULL <=> NULL)|((NULL IS NULL) = (NULL IS NULL))|
//+---------------+---------------------------------+
//|           true|                             true|
//+---------------+---------------------------------+

Example:

df1.join(df2, df1("id") === df2("o_id")).
withColumn("result", when( (df1("name") === df2("o_name")) && (df1("age") === df2("o_age") ) && 
(df1("lastname") === df2("o_lastname")|| (df1("lastname").isNull === df2("o_lastname").isNull)) && 
(df1("city") === df2("o_city")|| (df1("city").isNull === df2("o_city").isNull))  && 
(df1("country") === df2("o_country")), "Varified").otherwise("Not Varified")).
show()
//or using <>
df1.join(df2, df1("id") === df2("o_id")).withColumn("result", when( (df1("name") === df2("o_name")) && (df1("age") === df2("o_age")) && (df1("lastname") <=> df2("o_lastname")) && (df1("city") <=> df2("o_city"))  && (df1("country") === df2("o_country")), "Varified").otherwise("Not Varified")).show()
//+---+-----+---+--------+------+-------+----+------+-----+----------+------+---------+------------+
//| id| name|age|lastname|  city|country|o_id|o_name|o_age|o_lastname|o_city|o_country|      result|
//+---+-----+---+--------+------+-------+----+------+-----+----------+------+---------+------------+
//|  1|rohan| 26|  sharma|mumbai|  india|   1| rohan|   26|    sharma|mumbai|    india|    Varified|
//|  2|rohan| 26|  sharma|  null|  india|   2| rohan|   26|    sharma|  null|    india|    Varified|
//|  3|rohan| 26|    null|mumbai|  india|   3| rohan|   26|    sharma|mumbai|    india|Not Varified|
//|  4|rohan| 26|  sharma|mumbai|  india|   4| rohan|   26|      null|mumbai|    india|Not Varified|
//+---+-----+---+--------+------+-------+----+------+-----+----------+------+---------+------------+

最新更新