为什么我在数据框中添加3列时得到空值作为输出?

首先，我写了这个

df_raw_population_pi = spark.sql("SELECT country_code, age_group, cast(regexp_replace(percentage_2019, '[a-z]', '') AS decimal(4,2)) AS percentage_2019, cast(regexp_replace(percentage_2020, '[a-z]', '') AS decimal(4,2)) AS percentage_2020, cast(regexp_replace(percentage_2021, '[a-z]', '') AS decimal(4,2)) AS percentage_2021 FROM raw_population WHERE length(country_code) = 2")
df_raw_population_pi.head(10)

结果我得到的结果与NULL值

[Row(country_code='AD', age_group='Y0_14', percentage_2019=Decimal('13.90'), percentage_2020=None, percentage_2021=None),

所以，我用修改了代码

("SELECT country_code, age_group, COALESCE(cast(regexp_replace(percentage_2019, '[a-z]', '') AS decimal(4,2)),0) AS percentage_2019, COALESCE(cast(regexp_replace(percentage_2020, '[a-z]', '') AS decimal(4,2)),0) AS percentage_2020, COALESCE(cast(regexp_replace(percentage_2021, '[a-z]', '') AS decimal(4,2)),0) AS percentage_2021 FROM raw_population")

的结果空值消失，被替换为

0.00

[Row(country_code='AD', age_group='Y0_14', percentage_2019=Decimal('13.90'), percentage_2020=Decimal('0.00'), percentage_2021=Decimal('0.00')),
Row(country_code='AL', age_group='Y0_14', percentage_2019=Decimal('17.20'), percentage_2020=Decimal('16.80'), percentage_2021=Decimal('16.50')),
Row(country_code='AM', age_group='Y0_14', percentage_2019=Decimal('20.20'), percentage_2020=Decimal('20.30'), percentage_2021=Decimal('0.00')),
Row(country_code='AT', age_group='Y0_14', percentage_2019=Decimal('14.40'), percentage_2020=Decimal('14.40'), percentage_2021=Decimal('14.40')),

但是当我试图添加三列即percentage_2019, percentage_2020, percentage_2021

df_raw_population_pivot = spark.sql("SELECT country_code, age_group, (percentage_2019 + percentage_2020 + percentage_2021) AS sum_percentage_years FROM raw_population WHERE length(country_code) = 2")
df_raw_population_pivot.head(10)

我得到NULL

[Row(country_code='AD', age_group='Y0_14', sum_percentage_years=None),
Row(country_code='AL', age_group='Y0_14', sum_percentage_years=50.5),
Row(country_code='AM', age_group='Y0_14', sum_percentage_years=None),
Row(country_code='AT', age_group='Y0_14', sum_percentage_years=43.2),

。，

的值

percentage_2019 = 13.0 + percentage_2020 = 0.00 +percentage_2021 = 0.00 = NULL

我完全糊涂了，有人能帮我吗?

我所做的错误是，在第一步，我从blob存储读取和导入数据，并创建了一个data_frameand临时视图如下。

df_raw_population = spark.read.csv("/mnt/covidreportingpro1/raw/population", sep=r't', header=True)
df_raw_population = df_raw_population.withColumn('age_group', regexp_replace(split(df_raw_population['indic_de,geo\time'], ',')[0], 'PC_', '')).withColumn('country_code', split(df_raw_population['indic_de,geo\time'], ',')[1])
df_raw_population = df_raw_population.select(col("country_code").alias("country_code"),                                       col("age_group").alias("age_group"),col("2019 ").alias("percentage_2019"),col("2020 ").alias("percentage_2020"),                                 col("2021 ").alias("percentage_2021"))
df_raw_population.createOrReplaceTempView("raw_population")

现在我对percentage_2019, percentage_2020, percentage_2020执行强制转换操作(将数据类型从字符串更改为十进制)在这个raw_population。代码如下:

df_raw_population_pi = spark.sql("SELECT country_code, age_group, COALESCE(cast(regexp_replace(percentage_2019, '[a-z]', '') AS decimal(4,2)),0) AS percentage_2019, COALESCE(cast(regexp_replace(percentage_2020, '[a-z]', '') AS decimal(4,2)),0) AS percentage_2020, COALESCE(cast(regexp_replace(percentage_2021, '[a-z]', '') AS decimal(4,2)),0) AS percentage_2021 FROM raw_population WHERE length(country_code) = 2")
df_raw_population_pi.createOrReplaceTempView("raw_population_pi")

在这里，我创建了一个新的数据帧和临时视图，即df_raw_population_pi和raw_population_pi。

我所做的愚蠢是在下一步，我添加这些百分比列，而不是从FROM导入raw_population中的raw_population_pi数据。看看下面-

df_raw_population_pivot = spark.sql("SELECT country_code, age_group, (percentage_2019 + percentage_2020 + percentage_2021) AS sum_percentage_years FROM raw_population WHERE length(country_code) = 2").groupBy("country_code").pivot("age_group").sum("sum_percentage_years").orderBy("country_code")
df_raw_population_pivot.createOrReplaceTempView("raw_population_pivot")

而不是这个我应该写，

df_raw_population_pivot = spark.sql("SELECT country_code, age_group, (percentage_2019 + percentage_2020 + percentage_2021) AS sum_percentage_years FROM raw_population_pi WHERE length(country_code) = 2").groupBy("country_code").pivot("age_group").sum("sum_percentage_years").orderBy("country_code")
df_raw_population_pivot.createOrReplaceTempView("raw_population_pivot")

我在这个愚蠢的错误上浪费了大量的时间。(注意上面每一行的FROM)。

相关内容

最新更新

热门标签：