在最后一个函数pyspark上应用Filter



我有一个值为的数据帧

id, time,   boolean
1   ,4/12/21 11:36, 0
1   ,4/13/21 11:36, 0
1   ,4/14/21 11:36, 1
1   ,4/15/21 11:36, 1
1   ,4/16/21 11:36, 0
1   ,4/17/21 11:36, 0
1   ,4/18/21 11:36, 1
1   ,4/19/21 11:36, 0

我想要的输出是:

id, time,   boolean ,new_time
1   ,4/12/21 11:36, 0, null
1   ,4/13/21 11:36, 0,  null
1   ,4/14/21 11:36, 1,  4/14/21 11:36
1   ,4/15/21 11:36, 1,  4/15/21 11:36
1   ,4/16/21 11:36, 0,  4/15/21 11:36
1   ,4/17/21 11:36, 0,  4/15/21 11:36
1   ,4/18/21 11:36, 1,  4/18/21 11:36
1   ,4/19/21 11:36, 0,  4/18/21 11:36

输出包括一个新列,该列将在以下条件下具有值:a( 如果boolean为0,则上次填充时boolean为1。

我已经到了能够过滤记录的阶段。但我不确定如何在最后一个条件下添加过滤器。

w = Window.partitionBy('id').orderBy('time')
df=df.withColumn("new_time",when(df.boolean == 0,last("time",True).over(w)).otherwise(lit(None)))

您可以将when放入last:中

from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'new_time', 
F.last(
F.when(F.col('boolean') == 1, F.col('time')), 
True
).over(Window.partitionBy('id').orderBy('time'))
)
df2.show()
+---+-------------+-------+-------------+
| id|         time|boolean|     new_time|
+---+-------------+-------+-------------+
|  1|4/12/21 11:36|      0|         null|
|  1|4/13/21 11:36|      0|         null|
|  1|4/14/21 11:36|      1|4/14/21 11:36|
|  1|4/15/21 11:36|      1|4/15/21 11:36|
|  1|4/16/21 11:36|      0|4/15/21 11:36|
|  1|4/17/21 11:36|      0|4/15/21 11:36|
|  1|4/18/21 11:36|      1|4/18/21 11:36|
|  1|4/19/21 11:36|      0|4/18/21 11:36|
+---+-------------+-------+-------------+

最新更新