按两列分组,并聚合为其中一列的百分比



假设我有一个Pyspark DataFrame,如下所示:

df = spark.createDataFrame([
("Jan", "a", "apple", 1),
("Jan", "a", "pear", 2),
("Jan", "b", "apple", 1),
("Feb", "c", "carrot", 3),
("Feb", "c", "apple", 1),
], ["month", id", "product", "price"])
+-----+---+-------+-----+
|month| id|product|price|
+-----+---+-------+-----+
|  Jan|  a|  apple|    1|
|  Jan|  a|   pear|    2|
|  Jan|  b|  apple|    1|
|  Feb|  c| carrot|    3|
|  Feb|  c|  apple|    1|
+-----+---+-------+-----+

现在我想按月份和产品分组,并汇总每个客户的支出百分比。所需的输出如下所示:

+-----+-------+-------+
|month|product|agg_val|
+-----+-------+-------+
|  Jan|  apple|   1.33|
|  Jan|   pear|   0.66|
|  Feb| carrot|   0.75|
|  Feb|  apple|   0.25|
+-----+-------+-------+

其中,agg_vals计算为1.33=1/(1+2(+1/1,0.66=2/(1+1(,0.75=3/(3+1(和0.25=1/(3+1(

有没有办法在.agg()中做到这一点?我不知道怎么做

您可以为每个客户添加一列百分比,然后进行聚合:

from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'total', 
F.col('price')/F.sum('price').over(Window.partitionBy('id', 'month'))
).groupBy('month', 'product').agg(F.sum('total').alias('total'))
df2.show()
+-----+-------+------------------+
|month|product|             total|
+-----+-------+------------------+
|  Jan|  apple|1.3333333333333333|
|  Jan|   pear|0.6666666666666666|
|  Feb|  apple|              0.25|
|  Feb| carrot|              0.75|
+-----+-------+------------------+

最新更新