我有一个值为的数据帧
id, time, boolean
1 ,4/12/21 11:36, 0
1 ,4/13/21 11:36, 0
1 ,4/14/21 11:36, 1
1 ,4/15/21 11:36, 1
1 ,4/16/21 11:36, 0
1 ,4/17/21 11:36, 0
1 ,4/18/21 11:36, 1
1 ,4/19/21 11:36, 0
我想要的输出是:
id, time, boolean ,new_time
1 ,4/12/21 11:36, 0, null
1 ,4/13/21 11:36, 0, null
1 ,4/14/21 11:36, 1, 4/14/21 11:36
1 ,4/15/21 11:36, 1, 4/15/21 11:36
1 ,4/16/21 11:36, 0, 4/15/21 11:36
1 ,4/17/21 11:36, 0, 4/15/21 11:36
1 ,4/18/21 11:36, 1, 4/18/21 11:36
1 ,4/19/21 11:36, 0, 4/18/21 11:36
输出包括一个新列,该列将在以下条件下具有值:a( 如果boolean为0,则上次填充时boolean为1。
我已经到了能够过滤记录的阶段。但我不确定如何在最后一个条件下添加过滤器。
w = Window.partitionBy('id').orderBy('time')
df=df.withColumn("new_time",when(df.boolean == 0,last("time",True).over(w)).otherwise(lit(None)))
您可以将when
放入last
:中
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'new_time',
F.last(
F.when(F.col('boolean') == 1, F.col('time')),
True
).over(Window.partitionBy('id').orderBy('time'))
)
df2.show()
+---+-------------+-------+-------------+
| id| time|boolean| new_time|
+---+-------------+-------+-------------+
| 1|4/12/21 11:36| 0| null|
| 1|4/13/21 11:36| 0| null|
| 1|4/14/21 11:36| 1|4/14/21 11:36|
| 1|4/15/21 11:36| 1|4/15/21 11:36|
| 1|4/16/21 11:36| 0|4/15/21 11:36|
| 1|4/17/21 11:36| 0|4/15/21 11:36|
| 1|4/18/21 11:36| 1|4/18/21 11:36|
| 1|4/19/21 11:36| 0|4/18/21 11:36|
+---+-------------+-------+-------------+