我在使用pyspark编写逻辑时遇到问题。我希望通过基于user_id
计算给定日期的出现次数,在现有数据帧中添加一列。
示例数据帧:
user_id | 时间戳 |
---|---|
1 | 2021-01-01 9:00:00 |
1 | 2021-01-01 10:20:00 |
1 | 2021-01-01 18:00:00 |
2 | 2021-01-01 9:00:00 |
2 | 2021-01-02 9:00:00 |
1 | 2021-01-01 9:00:00 |
2 | 2021-01-02 9:3:00 |
1 | 2021-01-03 9:00:00 |
您可以使用窗口功能进行计数:
from pyspark.sql.window import Window
import pyspark.sql.functions as f
df.withColumn('perday', f.count('*').over(Window.partitionBy(df.user_id, f.to_date(df.timestamp)))).show()
+-------+-------------------+------+
|user_id| timestamp|perday|
+-------+-------------------+------+
| 2| 2021-01-02 9:00:00| 2|
| 2| 2021-01-02 9:30:00| 2|
| 1|2021-01-02 10:00:00| 1|
| 1| 2021-01-01 9:00:00| 3|
| 1|2021-01-01 10:20:00| 3|
| 1|2021-01-01 18:00:00| 3|
| 2| 2021-01-01 9:00:00| 1|
| 1| 2021-01-03 9:00:00| 1|
+-------+-------------------+------+