给定spark数据帧(cat_sdf)
+-------------+---------------+---------------+
|category |date |nums |
+-------------+---------------+---------------+
|A |2005-01-01 |1 |
|A |2005-02-01 |2 |
|A |2005-01-03 |3 |
|B |2004-03-01 |3 |
|B |2004-04-04 |1 |
|A |2006-04-06 |4 |
|B |2007-01-10 |1 |
+-------------+---------------+---------------+
我想按类别和年份分组,然后展开每个月类别的数量总和。
结果就像
+-----------+-------+--------+---------+--------+---------+----+--------+
|category |year |jan_num |feb_num |mar_num |apr_num | ...|dec_num |
+-----------+-------+--------+---------+--------+---------+----+--------+
|A |2005 |4 |2 |0 |0 | ...| 0 |
|A |2006 |0 |0 |0 |4 | ...| 0 |
|B |2004 |0 |0 |3 |1 | ...| 0 |
|B |2007 |1 |0 |0 |0 | ...| 0 |
+-----------+-------+--------+---------+--------+---------+----+--------+
from pyspark.sql.functions import when, year, month, sum
new_sdf = cat_sdf.groupBy('category', year('date')).agg(when(month('date') == 1, sum('nums')).alias('jan_num').
when(month('date') == 2, sum('nums')).alias('feb_num').
...otherwise(0)
但是行不通。
关于如何解决这个问题有什么建议吗?代替每个月使用case when
,您可以使用groupby
、category
、year
和month
,然后将月份pivot到列中
from pyspark.sql import functions as F
(df
.groupBy(
F.col('category'),
F.year('date').alias('year'),
F.date_format('date', 'MM_MMMM').alias('month')
)
.agg(F.sum('nums').alias('nums'))
.groupBy('category', 'year')
.pivot('month')
.agg(F.sum('nums').alias('nums'))
.fillna(0)
.orderBy('category', 'year')
.show()
)
# Output
# +--------+----+----------+-----------+--------+--------+
# |category|year|01_January|02_February|03_March|04_April|
# +--------+----+----------+-----------+--------+--------+
# | A|2005| 4| 2| 0| 0|
# | A|2006| 0| 0| 0| 4|
# | B|2004| 0| 0| 3| 1|
# | B|2007| 1| 0| 0| 0|
# +--------+----+----------+-----------+--------+--------+