我正在尝试使用Scala API for Spark,并希望将多个表连接在一起,然后将空值填充为零。
val left = Seq(("bob", 6), ("alice", 10), ("charlie", 4)).toDF("name", "count")
val right = Seq(("alice", 100),("bob", 23)).toDF("name","count")
val df = left.join(right, Seq("name"), "left_outer")
df.na.fill(0)
df.orderBy(left("count")).show(3)
但是,我得到
org.apache.spark.sql.AnalysisException: Reference 'count' is ambiguous, could be: count#6619, count#6629.;
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:264)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:168)
at org.apache.spark.sql.Dataset.resolve(Dataset.scala:218)
at org.apache.spark.sql.Dataset.col(Dataset.scala:921)
at org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$fillCol(DataFrameNaFunctions.scala:411)
at org.apache.spark.sql.DataFrameNaFunctions$$anonfun$2.apply(DataFrameNaFunctions.scala:162)
at org.apache.spark.sql.DataFrameNaFunctions$$anonfun$2.apply(DataFrameNaFunctions.scala:159)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:159)
我尝试了许多不同的fill(...)
函数(fill(0, Seq("right.count"))
、fill(0, Seq("count"))
等),但都给出了相同的失败。注释掉fill(...)
行使其完全正常,但是我想要零的地方有一些空值。
删除列名中的重复项:
scala> val df = left.join(right.withColumnRenamed("count", "count2"), Seq("name"), "left_outer")
.na.fill(0)
df: org.apache.spark.sql.DataFrame = [name: string, count: int ... 1 more field]
scala> df.show
+-------+-----+------+
| name|count|count2|
+-------+-----+------+
| bob| 6| 23|
| alice| 10| 100|
|charlie| 4| 0|
+-------+-----+------+
scala> df.orderBy(left("count")).show(3)
+-------+-----+------+
| name|count|count2|
+-------+-----+------+
|charlie| 4| 0|
| bob| 6| 23|
| alice| 10| 100|
+-------+-----+------+