没有替换Scala Dataframe NA值



我正在尝试使用Scala在数据帧上进行转换。

在此,我试图删除存在Null值的country_region,并且我试图用0填充NA的剩余列活动,确认,死亡和恢复。

但是,我注意到,当我显示数据框时,"活动"、"已确认"、"死亡"one_answers"已恢复"列中仍然有空值。

val JHU_COVID_19: DataFrame = spark.read
.format("snowflake")
.options(options)
.option("dbtable", "JHU_COVID_19")
.load()
val covid_19_grp = JHU_COVID_19.groupBy("COUNTRY_REGION","PROVINCE_STATE","CASE_TYPE","DATE")
.agg(count("CASE_TYPE").as("count"),sum("DIFFERENCE").as("sum"))

//display(covid_19_dataset)
val covid_19_pivot = covid_19_grp.groupBy("COUNTRY_REGION","PROVINCE_STATE","DATE")
.pivot("CASE_TYPE")
.agg(min("sum"))
//cleaning the data
covid_19_pivot.na.drop(Seq("COUNTRY_REGION")).show(false)
covid_19_pivot.na.fill(0,Array("Active","Confirmed","Deaths","Recovered"))
//Inner join GOOG_GLOBAL_MOBILITY_REPORT with the covid_19_pivot
val final_COVID_19_PREPARED = covid_19_pivot.join(mobility_dataset,Seq("COUNTRY_REGION","PROVINCE_STATE","DATE"))
display(final_COVID_19_PREPARED)

然后我尝试了另一种方法,我将变量分配给中间转换,这有效。

这是否意味着我总是需要分配变量?我怎么能像python一样在数据框架上进行转换而不分配中间变量?

val JHU_COVID_19: DataFrame = spark.read
.format("snowflake")
.options(options)
.option("dbtable", "JHU_COVID_19")
.load()
val covid_19_grp = JHU_COVID_19.groupBy("COUNTRY_REGION","PROVINCE_STATE","CASE_TYPE","DATE")
.agg(count("CASE_TYPE").as("count"),sum("DIFFERENCE").as("sum"))

//display(covid_19_dataset)
val covid_19_pivot = covid_19_grp.groupBy("COUNTRY_REGION","PROVINCE_STATE","DATE")
.pivot("CASE_TYPE")
.agg(min("sum"))
//cleaning the data
val test = covid_19_pivot.na.drop(Seq("COUNTRY_REGION"))
val test2 = test.na.fill(0,Array("Active","Confirmed","Deaths","Recovered"))
//Inner join GOOG_GLOBAL_MOBILITY_REPORT with the covid_19_pivot
val final_COVID_19_PREPARED = test2.join(mobility_dataset,Seq("COUNTRY_REGION","PROVINCE_STATE","DATE"))
display(final_COVID_19_PREPARED)

spark中的Dataframe在设计上是不可变的。当你在一个数据框架上执行任何转换时,如果你没有将它分配给一个不同的数据框架变量,那么转换将会发生,但你将无法在现有的数据框架变量中看到它。

如果你不想为每个转换创建新的数据框架变量,你可以链接转换操作,一旦你完成了所有的转换,将其分配给一个新的数据框架变量,你将能够合并所有的更改。

你的代码到目前为止使用单独的变量工作。

val covid_19_pivot = covid_19_grp.groupBy("COUNTRY_REGION","PROVINCE_STATE","DATE")
.pivot("CASE_TYPE")
.agg(min("sum"))
//cleaning the data
val test = covid_19_pivot.na.drop(Seq("COUNTRY_REGION"))
val test2 = test.na.fill(0,Array("Active","Confirmed","Deaths","Recovered"))

修改操作链,不创建新变量

val covid_19_pivot = covid_19_grp.groupBy("COUNTRY_REGION","PROVINCE_STATE","DATE")
.pivot("CASE_TYPE")
.agg(min("sum"))
.na.drop(Seq("COUNTRY_REGION"))
.na.fill(0,Array("Active","Confirmed","Deaths","Recovered"))

最新更新