如何将NULL替换为Spark DataFrame v1.6中的左外外连接中的NULL



我正在工作火花v1.6。我有以下两个数据范围,我想将null转换为左外的加入结果集中的0。有什么建议吗?

dataframes

val x: Array[Int] = Array(1,2,3)
val df_sample_x = sc.parallelize(x).toDF("x")
val y: Array[Int] = Array(3,4,5)
val df_sample_y = sc.parallelize(y).toDF("y")

左外联机

val df_sample_join = df_sample_x
  .join(df_sample_y,df_sample_x("x") === df_sample_y("y"),"left_outer")

Resultset

scala> df_sample_join.show
x  |  y
--------
1  |  null
2  |  null
3  |  3
But I want the resultset to be displayed as.
-----------------------------------------------
scala> df_sample_join.show
x  |  y
--------
1  |  0
2  |  0
3  |  3

只需使用 na.fill

df.na.fill(0, Seq("y"))

尝试:

val withReplacedNull = df_sample_join.withColumn("y", coalesce('y, lit(0)))

测试:

import org.apache.spark.sql.Row
import org.apache.spark.sql.functions.{col, udf}
import org.apache.spark.sql.types._
val list = List(Row("a", null), Row("b", null), Row("c", 1));
val rdd = sc.parallelize(list);
val schema = StructType(
    StructField("text", StringType, false) ::
    StructField("y", IntegerType, false) :: Nil)
val df = sqlContext.createDataFrame(rdd, schema)
val df1 = df.withColumn("y", coalesce('y, lit(0)));
df1.show()

您可以这样修复现有数据框架:

import org.apache.spark.sql.functions.{when,lit}
val correctedDf=df_sample_join.withColumn("y", when($"y".isNull,lit(0)).otherwise($"y"))

尽管T.gawęda的答案也有效,但我认为这更可读

相关内容

  • 没有找到相关文章

最新更新