假设我有一个Pyspark DataFrame,如下所示:
df = spark.createDataFrame([
("Jan", "a", "apple", 1),
("Jan", "a", "pear", 2),
("Jan", "b", "apple", 1),
("Feb", "c", "carrot", 3),
("Feb", "c", "apple", 1),
], ["month", id", "product", "price"])
+-----+---+-------+-----+
|month| id|product|price|
+-----+---+-------+-----+
| Jan| a| apple| 1|
| Jan| a| pear| 2|
| Jan| b| apple| 1|
| Feb| c| carrot| 3|
| Feb| c| apple| 1|
+-----+---+-------+-----+
现在我想按月份和产品分组,并汇总每个客户的支出百分比。所需的输出如下所示:
+-----+-------+-------+
|month|product|agg_val|
+-----+-------+-------+
| Jan| apple| 1.33|
| Jan| pear| 0.66|
| Feb| carrot| 0.75|
| Feb| apple| 0.25|
+-----+-------+-------+
其中,agg_vals计算为1.33=1/(1+2(+1/1,0.66=2/(1+1(,0.75=3/(3+1(和0.25=1/(3+1(
有没有办法在.agg()
中做到这一点?我不知道怎么做
您可以为每个客户添加一列百分比,然后进行聚合:
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'total',
F.col('price')/F.sum('price').over(Window.partitionBy('id', 'month'))
).groupBy('month', 'product').agg(F.sum('total').alias('total'))
df2.show()
+-----+-------+------------------+
|month|product| total|
+-----+-------+------------------+
| Jan| apple|1.3333333333333333|
| Jan| pear|0.6666666666666666|
| Feb| apple| 0.25|
| Feb| carrot| 0.75|
+-----+-------+------------------+