我有一个数据集df如下:
ID date class
1 2020/01/02 [math,english]
1 2020/01/03 [math,english]
1 2020/01/04 [math,english]
2 2020/01/02 [math,english]
2 2020/01/03 [math,english,art]
2 2020/01/04 [math,english]
2 2020/01/05 [math,english,art]
2 2020/01/06 [math,art]
2 2020/01/07 [math,art]
2 2020/01/08 [math,english,art]
我的当前代码是:
df.withColumn("c_order", rank()
.over(Window.partitionBy("ID","date")
.orderBy("class")))
我还尝试了dense_rank()和row_number(),但它们都不能提供所需的输出。
df.withColumn("c_order", dense_rank()
.over(Window.partitionBy("ID","date")
.orderBy("class")))
df.withColumn("c_order", row_number()
.over(Window.partitionBy("ID","date")
.orderBy("class")))
当前输出如下所示:
ID date class c_order
1 2020/01/02 [math,english] 1
1 2020/01/03 [math,english] 1
1 2020/01/04 [math,english] 1
2 2020/01/02 [math,english] 1
2 2020/01/03 [math,english,art] 1
2 2020/01/04 [math,english] 1
2 2020/01/05 [math,english,art] 1
2 2020/01/06 [math,art] 1
2 2020/01/07 [math,art] 1
2 2020/01/08 [math,english,art] 1
我希望输出如下
ID date class c_order
1 2020/01/02 [math,english] 1
1 2020/01/03 [math,english] 1
1 2020/01/04 [math,english] 1
2 2020/01/02 [math,english] 1
2 2020/01/03 [math,english,art] 2
2 2020/01/04 [math,english] 3
2 2020/01/05 [math,english,art] 4
2 2020/01/06 [math,art] 5
2 2020/01/07 [math,art] 5
2 2020/01/08 [math,english,art] 6
顺序只在类改变时增加。知道我哪里做错了吗?
谢谢!
你不能只做一个排名。您需要与前一行(使用lag
)进行比较,以检查类何时发生了更改。
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'diff',
F.coalesce(
F.col('class') != F.lag('class').over(Window.partitionBy('ID').orderBy('date')),
F.lit(False)
).cast('int')
).withColumn(
'c_order',
F.sum('diff').over(Window.partitionBy('ID').orderBy('date')) + 1
)
df2.show()
+---+----------+------------------+----+-------+
| ID| date| class|diff|c_order|
+---+----------+------------------+----+-------+
| 1|2020/01/02| [math,english]| 0| 1|
| 1|2020/01/03| [math,english]| 0| 1|
| 1|2020/01/04| [math,english]| 0| 1|
| 2|2020/01/02| [math,english]| 0| 1|
| 2|2020/01/03|[math,english,art]| 1| 2|
| 2|2020/01/04| [math,english]| 1| 3|
| 2|2020/01/05|[math,english,art]| 1| 4|
| 2|2020/01/06| [math,art]| 1| 5|
| 2|2020/01/07| [math,art]| 0| 5|
| 2|2020/01/08|[math,english,art]| 1| 6|
+---+----------+------------------+----+-------+