pyspark groupBy case when



给定spark数据帧(cat_sdf)

+-------------+---------------+---------------+
|category     |date           |nums           |
+-------------+---------------+---------------+
|A            |2005-01-01     |1              |
|A            |2005-02-01     |2              |
|A            |2005-01-03     |3              |
|B            |2004-03-01     |3              |
|B            |2004-04-04     |1              |
|A            |2006-04-06     |4              |
|B            |2007-01-10     |1              |
+-------------+---------------+---------------+

我想按类别和年份分组,然后展开每个月类别的数量总和。

结果就像

+-----------+-------+--------+---------+--------+---------+----+--------+
|category   |year   |jan_num |feb_num  |mar_num |apr_num  | ...|dec_num |
+-----------+-------+--------+---------+--------+---------+----+--------+
|A          |2005   |4       |2        |0       |0        | ...| 0      |
|A          |2006   |0       |0        |0       |4        | ...| 0      |
|B          |2004   |0       |0        |3       |1        | ...| 0      |
|B          |2007   |1       |0        |0       |0        | ...| 0      |
+-----------+-------+--------+---------+--------+---------+----+--------+
from pyspark.sql.functions import when, year, month, sum
new_sdf = cat_sdf.groupBy('category', year('date')).agg(when(month('date') == 1, sum('nums')).alias('jan_num').
when(month('date') == 2, sum('nums')).alias('feb_num').
...otherwise(0)

但是行不通。

关于如何解决这个问题有什么建议吗?

代替每个月使用case when,您可以使用groupbycategoryyearmonth,然后将月份pivot到列中

from pyspark.sql import functions as F
(df
.groupBy(
F.col('category'),
F.year('date').alias('year'),
F.date_format('date', 'MM_MMMM').alias('month')
)
.agg(F.sum('nums').alias('nums'))
.groupBy('category', 'year')
.pivot('month')
.agg(F.sum('nums').alias('nums'))
.fillna(0)

.orderBy('category', 'year')
.show()
)
# Output
# +--------+----+----------+-----------+--------+--------+
# |category|year|01_January|02_February|03_March|04_April|
# +--------+----+----------+-----------+--------+--------+
# |       A|2005|         4|          2|       0|       0|
# |       A|2006|         0|          0|       0|       4|
# |       B|2004|         0|          0|       3|       1|
# |       B|2007|         1|          0|       0|       0|
# +--------+----+----------+-----------+--------+--------+

最新更新