首先,我写了这个
df_raw_population_pi = spark.sql("SELECT country_code, age_group, cast(regexp_replace(percentage_2019, '[a-z]', '') AS decimal(4,2)) AS percentage_2019, cast(regexp_replace(percentage_2020, '[a-z]', '') AS decimal(4,2)) AS percentage_2020, cast(regexp_replace(percentage_2021, '[a-z]', '') AS decimal(4,2)) AS percentage_2021 FROM raw_population WHERE length(country_code) = 2")
df_raw_population_pi.head(10)
结果我得到的结果与NULL值
[Row(country_code='AD', age_group='Y0_14', percentage_2019=Decimal('13.90'), percentage_2020=None, percentage_2021=None),
所以,我用
修改了代码("SELECT country_code, age_group, COALESCE(cast(regexp_replace(percentage_2019, '[a-z]', '') AS decimal(4,2)),0) AS percentage_2019, COALESCE(cast(regexp_replace(percentage_2020, '[a-z]', '') AS decimal(4,2)),0) AS percentage_2020, COALESCE(cast(regexp_replace(percentage_2021, '[a-z]', '') AS decimal(4,2)),0) AS percentage_2021 FROM raw_population")
的结果空值消失,被替换为
0.00
[Row(country_code='AD', age_group='Y0_14', percentage_2019=Decimal('13.90'), percentage_2020=Decimal('0.00'), percentage_2021=Decimal('0.00')),
Row(country_code='AL', age_group='Y0_14', percentage_2019=Decimal('17.20'), percentage_2020=Decimal('16.80'), percentage_2021=Decimal('16.50')),
Row(country_code='AM', age_group='Y0_14', percentage_2019=Decimal('20.20'), percentage_2020=Decimal('20.30'), percentage_2021=Decimal('0.00')),
Row(country_code='AT', age_group='Y0_14', percentage_2019=Decimal('14.40'), percentage_2020=Decimal('14.40'), percentage_2021=Decimal('14.40')),
但是当我试图添加三列即percentage_2019, percentage_2020, percentage_2021
df_raw_population_pivot = spark.sql("SELECT country_code, age_group, (percentage_2019 + percentage_2020 + percentage_2021) AS sum_percentage_years FROM raw_population WHERE length(country_code) = 2")
df_raw_population_pivot.head(10)
我得到NULL
[Row(country_code='AD', age_group='Y0_14', sum_percentage_years=None),
Row(country_code='AL', age_group='Y0_14', sum_percentage_years=50.5),
Row(country_code='AM', age_group='Y0_14', sum_percentage_years=None),
Row(country_code='AT', age_group='Y0_14', sum_percentage_years=43.2),
。,
的值我完全糊涂了,有人能帮我吗?percentage_2019 = 13.0 + percentage_2020 = 0.00 +percentage_2021 = 0.00 = NULL
我所做的错误是,在第一步,我从blob存储读取和导入数据,并创建了一个data_frameand临时视图如下。
df_raw_population = spark.read.csv("/mnt/covidreportingpro1/raw/population", sep=r't', header=True)
df_raw_population = df_raw_population.withColumn('age_group', regexp_replace(split(df_raw_population['indic_de,geo\time'], ',')[0], 'PC_', '')).withColumn('country_code', split(df_raw_population['indic_de,geo\time'], ',')[1])
df_raw_population = df_raw_population.select(col("country_code").alias("country_code"), col("age_group").alias("age_group"),col("2019 ").alias("percentage_2019"),col("2020 ").alias("percentage_2020"), col("2021 ").alias("percentage_2021"))
df_raw_population.createOrReplaceTempView("raw_population")
现在我对percentage_2019, percentage_2020, percentage_2020执行强制转换操作(将数据类型从字符串更改为十进制)在这个raw_population
。代码如下:
df_raw_population_pi = spark.sql("SELECT country_code, age_group, COALESCE(cast(regexp_replace(percentage_2019, '[a-z]', '') AS decimal(4,2)),0) AS percentage_2019, COALESCE(cast(regexp_replace(percentage_2020, '[a-z]', '') AS decimal(4,2)),0) AS percentage_2020, COALESCE(cast(regexp_replace(percentage_2021, '[a-z]', '') AS decimal(4,2)),0) AS percentage_2021 FROM raw_population WHERE length(country_code) = 2")
df_raw_population_pi.createOrReplaceTempView("raw_population_pi")
在这里,我创建了一个新的数据帧和临时视图,即df_raw_population_pi
和raw_population_pi
。
我所做的愚蠢是在下一步,我添加这些百分比列,而不是从FROM
导入raw_population
中的raw_population_pi
数据。看看下面-
df_raw_population_pivot = spark.sql("SELECT country_code, age_group, (percentage_2019 + percentage_2020 + percentage_2021) AS sum_percentage_years FROM raw_population WHERE length(country_code) = 2").groupBy("country_code").pivot("age_group").sum("sum_percentage_years").orderBy("country_code")
df_raw_population_pivot.createOrReplaceTempView("raw_population_pivot")
而不是这个我应该写,
df_raw_population_pivot = spark.sql("SELECT country_code, age_group, (percentage_2019 + percentage_2020 + percentage_2021) AS sum_percentage_years FROM raw_population_pi WHERE length(country_code) = 2").groupBy("country_code").pivot("age_group").sum("sum_percentage_years").orderBy("country_code")
df_raw_population_pivot.createOrReplaceTempView("raw_population_pivot")
我在这个愚蠢的错误上浪费了大量的时间。(注意上面每一行的FROM
)。