左反联接不将 null 视为 Spark 中的重复值



我有拖曳表,我只想从源表中读取唯一记录,这两个表都有空值。

source table:
name| age| degree| dept    
aaa | 20| ece |null
bbb |20 |it |null
ccc |30 |mech| null
target table

name| age |degree |dept
aaa  |20| ece |null
bbb |20 |it| null

soruce_df.join(target_df,seq("name","age","degree"(,"leftanti"( ->工作

soruce_df.join(target_df,seq("name","age","degree","Department"(,"leftanti"( ->不起作用

Now i need to pick only 3rd record from source ,
If i use name ,age ,degree   as my joining key , it's working as expected
But when i include dept it's picking all the records from source table.
Please help me.
进行对

空值安全的相等性测试。

soruce_df.join(target_df, soruce_df("name") <=> target_df("name") && soruce_df("age") <=> target_df("age") &&
soruce_df("degree") <=> target_df("degree") && soruce_df("dept") <=> target_df("dept")
,"leftanti").show(false)
/**
* +----+---+------+----+
* |name|age|degree|dept|
* +----+---+------+----+
* |ccc |30 |mech  |null|
* +----+---+------+----+
*/

在 python 中,将<=>替换为方法调用eqNullSafe,如以下示例-

df1.join(df2, df1["value"].eqNullSafe(df2["value"]))

Spark 提供了 null 安全的相等运算符来处理这种情况。 遇到了类似的情况,即由于一列为空而插入重复记录。 空 == 空返回空 空 <=> 空返回假 请参阅文档 https://spark.apache.org/docs/3.0.0-preview/sql-ref-null-semantics.html

最新更新