PySpark窗口函数-从当前行开始的n个月内



当当前行等于1时,我想删除当前行x个月内的所有行(基于日期的前后)。

。给定这个PySpark df:

0 00000000000

在您的情况下,rangeBetween可能非常有用。它关注值,只取在范围内的值。例如,rangeBetween(-2, 2)将取从2以下到2以上的所有值。由于rangeBetween不与日期(或字符串)一起工作,我使用months_between将它们转换为整数。

from pyspark.sql import functions as F, Window
df = spark.createDataFrame(
[('a', '2020-01-01', 0),
('a', '2020-02-01', 0),
('a', '2020-03-01', 0),
('a', '2020-04-01', 1),
('a', '2020-05-01', 0),
('a', '2020-06-01', 0),
('a', '2020-07-01', 0),
('a', '2020-08-01', 0),
('a', '2020-09-01', 0),
('a', '2020-10-01', 1),
('a', '2020-11-01', 0),
('b', '2020-01-01', 0),
('b', '2020-02-01', 0),
('b', '2020-03-01', 0),
('b', '2020-05-01', 1)],
['id', 'date', 'target']
)
window = 2
windowSpec = Window.partitionBy('id').orderBy(F.months_between('date', F.lit('1970-01-01'))).rangeBetween(-window, window)
df = df.withColumn('to_remove', F.sum('target').over(windowSpec) - F.col('target'))
df = df.where(F.col('to_remove') == 0).drop('to_remove')
df.show()
# +---+----------+------+
# | id|      date|target|
# +---+----------+------+
# |  a|2020-01-01|     0|
# |  a|2020-04-01|     1|
# |  a|2020-07-01|     0|
# |  a|2020-10-01|     1|
# |  b|2020-01-01|     0|
# |  b|2020-02-01|     0|
# |  b|2020-05-01|     1|
# +---+----------+------+

最新更新