Spark 允许对不存在的列进行筛选/选择



我在使用 Spark SQL 库版本 2.2.1 时遇到了非常奇怪的行为 - 或者至少是非预期的行为。 我有一个简单的数据帧:

val df = spark.read.parquet("...")
df: org.apache.spark.sql.DataFrame = [FIELD_A: string, FIELD_B: string ... 244 more fields]
val tmp = df.select("FIELD_A")
tmp: org.apache.spark.sql.DataFrame = [FIELD_A: string]

如果我执行以下操作,我会收到意外的行为:

// the following line does not work (as expected)
tmp.select("FIELD_B").count()
// the following line WORKS (unexpectedly)
tmp.where($"FIELD_B" === "abc").count()

为什么我可以筛选不应存在于数据帧tmp中的非选定列? 我在互联网上找不到任何原因或相关解释。

谢谢!

看起来优化程序重新排列了select(项目(和where(过滤器(阶段,而没有首先检查原始计划是否真正有效。

scala> case class Foo(FIELD_A: String, FIELD_B: String)
defined class Foo
scala> val tmp = List(Foo("q", "w"), Foo("e", "r")).toDF
tmp: org.apache.spark.sql.DataFrame = [FIELD_A: string, FIELD_B: string]
scala> tmp.select("FIELD_A").where($"FIELD_B" === "r").explain(true)
== Parsed Logical Plan ==
'Filter ('FIELD_B = r)
+- Project [FIELD_A#2]
+- LocalRelation [FIELD_A#2, FIELD_B#3]
== Analyzed Logical Plan ==
FIELD_A: string
Project [FIELD_A#2]
+- Filter (FIELD_B#3 = r)
+- Project [FIELD_A#2, FIELD_B#3]
+- LocalRelation [FIELD_A#2, FIELD_B#3]
== Optimized Logical Plan ==
Project [FIELD_A#2]
+- Filter (isnotnull(FIELD_B#3) && (FIELD_B#3 = r))
+- LocalRelation [FIELD_A#2, FIELD_B#3]
== Physical Plan ==
*Project [FIELD_A#2]
+- *Filter (isnotnull(FIELD_B#3) && (FIELD_B#3 = r))
+- LocalTableScan [FIELD_A#2, FIELD_B#3]
scala> tmp.select("FIELD_A").where($"FIELD_B" === "r").collect()
res5: Array[org.apache.spark.sql.Row] = Array([e])

最新更新