我很高兴得到一些帮助:-)我有以下数据框架:
Type | Number | Date | Value |
------------------------------------
A | 1 | 2022-10-01 | 5 |
A | 2 | 2022-10-01 | 8 |
A | 3 | 2022-11-23 | 4 |
B | 1 | 2022-02-02 | 1 |
B | 2 | 2022-02-04 | 9 |
B | 3 | 2022-02-04 | 3 |
B | 4 | 2022-02-04 | 1 |
结果应按Type和Date分组,Value应是Number为最大值(带分组条件)的行中的值:
Type | Number | Date | Value |
------------------------------------
A | 2 | 2022-10-01 | 8 |
A | 3 | 2022-11-23 | 4 |
B | 1 | 2022-02-02 | 1 |
B | 4 | 2022-02-04 | 1 |
我尝试了以下方法(在一些变化中;也使用groupBy()),但没有成功:
from pyspark.sql.functions import *
from pyspark.sql import Window
w = Window.partitionBy("Type", "Date")
df_result = (
df.withColumn("MaxNumber", max("Number").over(w))
.where(col("Number") == col("MaxNumber"))
.drop("MaxNumber")
)
df_result.display()
谢谢你的帮助。
你的代码运行正常:
df.withColumn("MaxNumber", F.max("Number").over(w)).where(
F.col("Number") == F.col("MaxNumber")
).show()
+----+------+----------+-----+---------+
|Type|Number| Date|Value|MaxNumber|
+----+------+----------+-----+---------+
| A| 2|2022-10-01| 8| 2|
| A| 3|2022-11-23| 4| 3|
| B| 1|2022-02-02| 1| 1|
| B| 4|2022-02-04| 1| 4|
+----+------+----------+-----+---------+
我会做一点不同,使用row_number
代替:
from pyspark.sql import functions as F
from pyspark.sql import Window
w = Window.partitionBy("Type", "Date").orderBy(F.col("Number").desc())
df.withColumn("MaxNumber", F.row_number().over(w)).show()
+----+------+----------+-----+---------+
|Type|Number| Date|Value|MaxNumber|
+----+------+----------+-----+---------+
| A| 2|2022-10-01| 8| 1| <--
| A| 1|2022-10-01| 5| 2|
| A| 3|2022-11-23| 4| 1| <--
| B| 1|2022-02-02| 1| 1| <--
| B| 4|2022-02-04| 1| 1| <--
| B| 3|2022-02-04| 3| 2|
| B| 2|2022-02-04| 9| 3|
+----+------+----------+-----+---------+
然后过滤where("MaxNumber = 1")
:
df.withColumn("MaxNumber", F.row_number().over(w)).where("MaxNumber = 1").drop(
"MaxNumber"
).show()
+----+------+----------+-----+
|Type|Number| Date|Value|
+----+------+----------+-----+
| A| 2|2022-10-01| 8|
| A| 3|2022-11-23| 4|
| B| 1|2022-02-02| 1|
| B| 4|2022-02-04| 1|
+----+------+----------+-----+