Python:在日期列表中,如何删除属于连续三天或更多天的日期?



我有一个名为dates的日期列表:

我想从此列表中删除属于连续三天或更多天的日期。这些是我在列表中缩进的日期。

最快的方法是什么?

[datetime.date(2018, 7, 2),
datetime.date(2018, 7, 5),
datetime.date(2018, 7, 7),
datetime.date(2018, 7, 15),
datetime.date(2018, 7, 16),
datetime.date(2018, 7, 17),
datetime.date(2018, 7, 29),
datetime.date(2018, 8, 13),
datetime.date(2018, 8, 27),
datetime.date(2018, 9, 19),
datetime.date(2018, 10, 25),
datetime.date(2018, 11, 9),
datetime.date(2018, 12, 21),
datetime.date(2018, 12, 22),
datetime.date(2018, 12, 23),
datetime.date(2018, 12, 24),
datetime.date(2018, 12, 25),
datetime.date(2019, 1, 2),
datetime.date(2019, 1, 3),
datetime.date(2019, 1, 4),
datetime.date(2019, 1, 5),
datetime.date(2019, 1, 6),
datetime.date(2019, 1, 7),
datetime.date(2019, 1, 8),
datetime.date(2019, 2, 27),
datetime.date(2019, 2, 28),
datetime.date(2019, 3, 1),
datetime.date(2019, 3, 2),
datetime.date(2019, 3, 3),
datetime.date(2019, 3, 6),
datetime.date(2019, 3, 11),
datetime.date(2019, 3, 12),
datetime.date(2019, 3, 13),
datetime.date(2019, 3, 14),
datetime.date(2019, 3, 16),
datetime.date(2019, 3, 25),
datetime.date(2019, 3, 27),
datetime.date(2019, 3, 29),
datetime.date(2019, 3, 30),
datetime.date(2019, 4, 8)]

因此,删除日期(属于连续三天或更多天的范围(后的预期结果应为:

[datetime.date(2018, 7, 2),
datetime.date(2018, 7, 5),
datetime.date(2018, 7, 7),
datetime.date(2018, 7, 29),
datetime.date(2018, 8, 13),
datetime.date(2018, 8, 27),
datetime.date(2018, 9, 19),
datetime.date(2018, 10, 25),
datetime.date(2018, 11, 9),
datetime.date(2019, 3, 6),
datetime.date(2019, 3, 16),
datetime.date(2019, 3, 25),
datetime.date(2019, 3, 27),
datetime.date(2019, 3, 29),
datetime.date(2019, 3, 30),
datetime.date(2019, 4, 8)]

我的解决方案如下:

import datetime
dates = [datetime.date(2018, 7, 2),
datetime.date(2018, 7, 5),
...,
datetime.date(2019, 3, 30),
datetime.date(2019, 4, 8)]

def are_consecutive(d1, d2):
return d2-d1 == datetime.timedelta(1)
filtered_out = set()
consecutive = set()
for i,d in enumerate(sorted(dates)):
try:
d1,d2 = dates[i:i+2]
except:
break
if are_consecutive(d1, d2):
consecutive.add(d1)
consecutive.add(d2)
else:
if len(consecutive) >= 3:
for date in consecutive:
filtered_out.add(date)
consecutive = set()
selected = [d for d in dates if d not in filtered_out]

selected是:

[datetime.date(2018, 7, 2),
datetime.date(2018, 7, 5),
datetime.date(2018, 7, 7),
datetime.date(2018, 7, 29),
datetime.date(2018, 8, 13),
datetime.date(2018, 8, 27),
datetime.date(2018, 9, 19),
datetime.date(2018, 10, 25),
datetime.date(2018, 11, 9),
datetime.date(2019, 3, 6),
datetime.date(2019, 3, 16),
datetime.date(2019, 3, 25),
datetime.date(2019, 3, 27),
datetime.date(2019, 3, 29),
datetime.date(2019, 3, 30),
datetime.date(2019, 4, 8)]

如果您认为 2019 年 2 月 27 日、2 月 28 日和 3 月 1 日是连续的,这是正确的,它们是连续的!

简要解释一下代码:are_consecutive()只是检查两个日期是否连续。如果是这样,他们的差异应该返回datetime.timedelta(1).我使用此功能检查每个日期与下一个日期。日期在循环的开头排序,只是为了确保它们的顺序。 如果日期是连续的,它们将存储在consecutive集中,如果不是,那么我检查到目前为止存储了多少个连续的日期。如果为 3 或更多,则将结果保存在filtered_out集中,否则不保存。consecutive每次两个日期不连续时重置。

我的回答如下:

import datetime
import numpy as np
dates = [datetime.date(2018, 7, 2),
datetime.date(2018, 7, 5),
......
datetime.date(2019, 4, 8)]
dates = np.array(dates)
inds = np.ones_like(dates, np.bool)
i = 0
while i < len(dates) - 1:
datei = dates[i]
for j in range(i + 1, len(dates)):
datej = dates[j]
if datei + datetime.timedelta(j - i) != datej:
break
if j - i >= 3:
inds[range(i, j)] = False
i = j
dates = dates[inds]
print(dates)

输出:

[datetime.date(2018, 7, 2) datetime.date(2018, 7, 5)
datetime.date(2018, 7, 7) datetime.date(2018, 7, 29)
datetime.date(2018, 8, 13) datetime.date(2018, 8, 27)
datetime.date(2018, 9, 19) datetime.date(2018, 10, 25)
datetime.date(2018, 11, 9) datetime.date(2019, 3, 6)
datetime.date(2019, 3, 16) datetime.date(2019, 3, 25)
datetime.date(2019, 3, 27) datetime.date(2019, 3, 29)
datetime.date(2019, 3, 30) datetime.date(2019, 4, 8)]

不幸的是,打印太长而无法打印,所以我想我会把答案留给评论。随意尝试代码,并告诉我是否忘记了边缘情况。;)

数据帧必须按升序排序

#   Creating the DataFrame. Deleting some dates in order to have some that
#   are not consecutives and isolated.
df = pd.DataFrame({
'date' : pd.date_range(start='01/01/2018', end='31/01/2018')
})
df = df.loc[ ~df.index.isin([1, 3, 5, 10, 12, 15, 25]) ]
#   First : Count the consecutive days.
#   Take the difference of each days, and make a boolean mask
#   of those who have a difference not equal to 1.
#   We now have False where the difference is 1, and True where it is not.
#   The cumulative sum gives us 'groups' of consecutive dates.
df['range_count'] = df['date'].diff().dt.days.ne(1).cumsum()
#   Use the previous groups and count the number of items in each group.
#   I use transform to apply the group counts to each row.
df['check'] = df.groupby('range_count')['date'].transform('count')
#   Then, the select is easy.
print(
df.loc[df['check'] < 3, 'date']
)
# 0    2018-01-01
# 2    2018-01-03
# 4    2018-01-05
# 11   2018-01-12
# 13   2018-01-14
# 14   2018-01-15

假设dates是您按升序提供的列表,则以下代码:

j = 0                        # index of the date checked for consecutives
while j < len(dates):
date = dates[j]          # the date checked for consecutives
i = 1                    # counter of consecutive days in the list
j += 1 
while True:              # count consecutive days and delete when 3 or more found
date = date + datetime.timedelta(days=1) # check if the following day is in the list
if date in dates:    # if found in the list then:
i += 1               # count it and check for the next.
else:                # if not in the list then:
if i > 2:            # if 3 or more consecutive dates are found
del dates[j-1:j+i-1]   # delete them from list
break
print(dates)

具有所需的输出:

[datetime.date(2018, 7, 2), datetime.date(2018, 7, 5), datetime.date(2018, 7, 7), datetime.date(2018, 7, 29), datetime.date(2018, 8, 13), datetime.date(2018, 8, 27), datetime.date(2018, 9, 19), datetime.date(2018, 10, 25), datetime.date(2018, 11, 9), datetime.date(2019, 1, 2), datetime.date(2019, 2, 27), datetime.date(2019, 3, 6), datetime.date(2019, 3, 16), datetime.date(2019, 3, 25), datetime.date(2019, 3, 27), datetime.date(2019, 3, 29), datetime.date(2019, 3, 30), datetime.date(2019, 4, 8)]

灵感来自这篇文章。如果首先找到所有连续的天数,请将此连续期间分组,最后查找具有 3 个或更多连续周期的期间。

s = pd.Series([
datetime.date(2018, 7, 2),
datetime.date(2018, 7, 5),
datetime.date(2018, 7, 7),
...
])
# Define 1 day difference
day = pd.Timedelta('1d')
# Find all consecutive days  
consecutive_days = ((s - s.shift(-1)).abs() == day) | ((s.diff() == day))
consecutive_groups = (s.diff() != day).cumsum() # group into consecutive periods
# Find groups with 3 or more consecutive days
unique, count = np.unique(consecutive_groups , return_counts=True)
s[~consecutive_groups .isin(unique[count >= 3])].tolist()

这将返回以下内容。

[datetime.date(2018, 7, 2),
datetime.date(2018, 7, 5),
datetime.date(2018, 7, 7),
datetime.date(2018, 7, 29),
datetime.date(2018, 8, 13),
datetime.date(2018, 8, 27),
datetime.date(2018, 9, 19),
datetime.date(2018, 10, 25),
datetime.date(2018, 11, 9),
datetime.date(2019, 3, 6),
datetime.date(2019, 3, 16),
datetime.date(2019, 3, 25),
datetime.date(2019, 3, 27),
datetime.date(2019, 3, 29),
datetime.date(2019, 3, 30),
datetime.date(2019, 4, 8)]

我对此的看法:

unconsecutive_dates = []
previous = None
for d in sorted(dates):
if unconsecutive_dates and d == unconsecutive_dates[-1] + datetime.timedelta(days=1):
unconsecutive_dates.pop()
elif previous != d - datetime.timedelta(days=1):
unconsecutive_dates.append(d)
previous = d

最新更新