Spark DataDrame 中的 === null 和 isNull 之间的区别



当我们使用时,我对差异有点困惑

 df.filter(col("c1") === null) and df.filter(col("c1").isNull) 

我正在获取计数的相同数据帧=== 空,但在 isNull 中计数为零。请帮助我了解其中的区别。谢谢

首先,不要在 Scala 代码中使用null,除非出于兼容性原因确实必须这样做。

关于您的问题,它是普通的SQL。 col("c1") === null被解释为c1 = NULL,并且由于NULL标记未定义的值,因此对于任何值(包括NULL本身)的结果都是未定义的。

spark.sql("SELECT NULL = NULL").show
+-------------+
|(NULL = NULL)|
+-------------+
|         null|
+-------------+
spark.sql("SELECT NULL != NULL").show
+-------------------+
|(NOT (NULL = NULL))|
+-------------------+
|               null|
+-------------------+
spark.sql("SELECT TRUE != NULL").show
+------------------------------------+
|(NOT (true = CAST(NULL AS BOOLEAN)))|
+------------------------------------+
|                                null|
+------------------------------------+
spark.sql("SELECT TRUE = NULL").show
+------------------------------+
|(true = CAST(NULL AS BOOLEAN))|
+------------------------------+
|                          null|
+------------------------------+

检查NULL的唯一有效方法是:

  • IS NULL

    spark.sql("SELECT NULL IS NULL").show
    
    +--------------+
    |(NULL IS NULL)|
    +--------------+
    |          true|
    +--------------+
    
    spark.sql("SELECT TRUE IS NULL").show
    
    +--------------+
    |(true IS NULL)|
    +--------------+
    |         false|
    +--------------+
    
  • IS NOT NULL

    spark.sql("SELECT NULL IS NOT NULL").show
    
    +------------------+
    |(NULL IS NOT NULL)|
    +------------------+
    |             false|
    +------------------+
    
    spark.sql("SELECT TRUE IS NOT NULL").show
    
    +------------------+
    |(true IS NOT NULL)|
    +------------------+
    |              true|
    +------------------+
    

DataFrame DSL中分别作为Column.isNullColumn.isNotNull实现。

对于NULL安全的比较,请使用IS DISTINCT/IS NOT DISTINCT

spark.sql("SELECT NULL IS NOT DISTINCT FROM NULL").show
+---------------+
|(NULL <=> NULL)|
+---------------+
|           true|
+---------------+
spark.sql("SELECT NULL IS NOT DISTINCT FROM TRUE").show
+--------------------------------+
|(CAST(NULL AS BOOLEAN) <=> true)|
+--------------------------------+
|                           false|
+--------------------------------+

not(_ <=> _)/<=>

spark.sql("SELECT NULL AS col1, NULL AS col2").select($"col1" <=> $"col2").show
+---------------+
|(col1 <=> col2)|
+---------------+
|           true|
+---------------+
spark.sql("SELECT NULL AS col1, TRUE AS col2").select($"col1" <=> $"col2").show
+---------------+
|(col1 <=> col2)|
+---------------+
|          false|
+---------------+

分别在SQL和DataFrame DSL中。

相关

在 Apache Spark Join 中包含空值

通常,阐明 Spark 数据帧中意外结果的最佳方法是查看解释计划。请考虑以下示例:

import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions._
object Example extends App {
  val session = SparkSession.builder().master("local[*]").getOrCreate()
  case class Record(c1: String, c2: String)
  val data = List(Record("a", "b"), Record(null, "c"))
  val rdd = session.sparkContext.parallelize(data)
  import session.implicits._
  val df: DataFrame = rdd.toDF
  val filtered = df.filter(col("c1") === null)
  println(filtered.count()) // <-- outputs 0, not expected
  val filtered2 = df.filter(col("c1").isNull)
  println(filtered2.count())
  println(filtered2) // <- outputs 1, as expected
  filtered.explain(true)
  filtered2.explain(true)
}

第一个解释计划显示:

== Physical Plan ==
*Filter (isnotnull(c1#2) && null)
+- Scan ExistingRDD[c1#2,c2#3]
== Parsed Logical Plan ==
'Filter isnull('c1)
+- LogicalRDD [c1#2, c2#3]

这个过滤器子句看起来很荒谬。null &&确保这永远不会解析为true

第二个解释计划如下所示:

== Physical Plan ==
*Filter isnull(c1#2)
+- Scan ExistingRDD[c1#2,c2#3]

在这里,过滤器是期望和想要的。

相关内容

  • 没有找到相关文章