我有一个PySpark数据帧-
df1 = spark.createDataFrame([
("u1", 10),
("u1", 20),
("u2", 10),
("u2", 10),
("u2", 30),
],
['user_id', 'var1'])
print(df1.printSchema())
df1.show(truncate=False)
看起来像
root
|-- user_id: string (nullable = true)
|-- var1: long (nullable = true)
None
+-------+----+
|user_id|var1|
+-------+----+
|u1 |10 |
|u1 |20 |
|u2 |10 |
|u2 |10 |
|u2 |30 |
+-------+----+
我想以这样一种方式给出行索引,即对user_id(按升序排序(和var1(按降序排序(上的每个组重新开始索引。
所需的输出应该看起来像-
+-------+----+-----+
|user_id|var1|order|
+-------+----+-----+
|u1 |10 | 1|
|u1 |20 | 2|
|u2 |10 | 1|
|u2 |10 | 2|
|u2 |30 | 3|
+-------+----+-----+
我该如何做到这一点?
这只是一个行号操作:
from pyspark.sql import functions as F, Window
df2 = df1.withColumn(
'order',
F.row_number().over(Window.partitionBy('user_id').orderBy('var1'))
)
df2.show()
+-------+----+-----+
|user_id|var1|order|
+-------+----+-----+
| u1| 10| 1|
| u1| 20| 2|
| u2| 10| 1|
| u2| 10| 2|
| u2| 30| 3|
+-------+----+-----+