PySpark从多个列分组中获取最大值的行



我很高兴得到一些帮助:-)我有以下数据框架:

Type | Number |    Date    | Value |
------------------------------------
A |      1 | 2022-10-01 |     5 |
A |      2 | 2022-10-01 |     8 |
A |      3 | 2022-11-23 |     4 |
B |      1 | 2022-02-02 |     1 |
B |      2 | 2022-02-04 |     9 |
B |      3 | 2022-02-04 |     3 |
B |      4 | 2022-02-04 |     1 |

结果应按Type和Date分组,Value应是Number为最大值(带分组条件)的行中的值:

Type | Number |    Date    | Value |
------------------------------------
A |      2 | 2022-10-01 |     8 |
A |      3 | 2022-11-23 |     4 |
B |      1 | 2022-02-02 |     1 |
B |      4 | 2022-02-04 |     1 |

我尝试了以下方法(在一些变化中;也使用groupBy()),但没有成功:

from pyspark.sql.functions import *
from pyspark.sql import Window
w = Window.partitionBy("Type", "Date")
df_result = (
df.withColumn("MaxNumber", max("Number").over(w))
.where(col("Number") == col("MaxNumber"))
.drop("MaxNumber")
)
df_result.display()

谢谢你的帮助。

你的代码运行正常:

df.withColumn("MaxNumber", F.max("Number").over(w)).where(
F.col("Number") == F.col("MaxNumber")
).show()
+----+------+----------+-----+---------+
|Type|Number|      Date|Value|MaxNumber|
+----+------+----------+-----+---------+
|   A|     2|2022-10-01|    8|        2|
|   A|     3|2022-11-23|    4|        3|
|   B|     1|2022-02-02|    1|        1|
|   B|     4|2022-02-04|    1|        4|
+----+------+----------+-----+---------+

我会做一点不同,使用row_number代替:

from pyspark.sql import functions as F
from pyspark.sql import Window
w = Window.partitionBy("Type", "Date").orderBy(F.col("Number").desc())
df.withColumn("MaxNumber", F.row_number().over(w)).show()
+----+------+----------+-----+---------+
|Type|Number|      Date|Value|MaxNumber|
+----+------+----------+-----+---------+
|   A|     2|2022-10-01|    8|        1|  <--
|   A|     1|2022-10-01|    5|        2|
|   A|     3|2022-11-23|    4|        1|  <--
|   B|     1|2022-02-02|    1|        1|  <--
|   B|     4|2022-02-04|    1|        1|  <--
|   B|     3|2022-02-04|    3|        2|
|   B|     2|2022-02-04|    9|        3|
+----+------+----------+-----+---------+

然后过滤where("MaxNumber = 1"):

df.withColumn("MaxNumber", F.row_number().over(w)).where("MaxNumber = 1").drop(
"MaxNumber"
).show()
+----+------+----------+-----+
|Type|Number|      Date|Value|
+----+------+----------+-----+
|   A|     2|2022-10-01|    8|
|   A|     3|2022-11-23|    4|
|   B|     1|2022-02-02|    1|
|   B|     4|2022-02-04|    1|
+----+------+----------+-----+

最新更新