确定datetime index是否在日期范围列表中



我有以下代码数据…

import pandas as pd, numpy as np
from datetime import datetime
end_dt = datetime.today()
st_dt = (end_dt + pd.DateOffset(-10)).date()
df_index = pd.date_range(st_dt, end_dt)
df = pd.DataFrame(index=df_index, columns=['in_range'])
data = [pd.to_datetime(['2022-11-08','2022-11-10']), pd.to_datetime(['2022-11-13','2022-11-15'])]
dt_ranges = pd.DataFrame(data,columns={'st_dt':'datetimens[64]', 'end_dt': 'datetimens[64]'})

df: 生成以下两个数据帧:
in_range
2022-11-08  NaN
2022-11-09  NaN
2022-11-10  NaN
2022-11-11  NaN
2022-11-12  NaN
2022-11-13  NaN
2022-11-14  NaN
2022-11-15  NaN
2022-11-16  NaN
2022-11-17  NaN
2022-11-18  NaN

和date_ranges:

st_dt       end_dt
0   2022-11-08  2022-11-10
1   2022-11-13  2022-11-15

我想更新'in_range'列,以指示索引是否落在第二个数据帧的任何一对开始和结束日期内。所以我应该这样结束:

in_range
2022-11-08  True
2022-11-09  True
2022-11-10  True
2022-11-11  NaN
2022-11-12  NaN
2022-11-13  True
2022-11-14  True
2022-11-15  True
2022-11-16  NaN
2022-11-17  NaN
2022-11-18  NaN

我已经走下尝试使用lambda和迭代来做到这一点的路径。但对我来说,这似乎是有效的。

def in_range(index_date, date_ranges):
for r in date_ranges.values:
if (r[0] >= index_date) & (r[1] <= index_date):
return True
return False
df['in_range'] = df.reset_index().apply(lambda x: in_range(x.date, dt_ranges), axis=1)

上面的总是将in_range设置为nan,尽管代码返回正确的值。我怀疑这是因为我重置了索引,所以它不能对齐。另外,正如前面提到的,这个解决方案可能非常低效

是否有更python/流行病的方法来做到这一点?

使用merge_asof和布尔索引:

s = df.index.to_series()
m = (pd.merge_asof(s.rename('st_dt'), dt_ranges)
['end_dt'].ge(s.to_numpy()).to_numpy()
)
df.loc[m, 'in_range'] = True

NB。dt_ranges中的间隔应不重叠

输出:

in_range
2022-11-08     True
2022-11-09     True
2022-11-10     True
2022-11-11      NaN
2022-11-12      NaN
2022-11-13     True
2022-11-14     True
2022-11-15     True
2022-11-16      NaN
2022-11-17      NaN
2022-11-18      NaN

一个选项是使用conditional_join计算非对等连接,它可以处理重叠:

# pip install pyjanitor
import pandas as pd
import janitor
(
df
.reset_index()
.conditional_join(
dt_ranges, 
('index', 'st_dt', '>='), 
('index', 'end_dt', '<='), 
# depending on your data size
# setting use_numba to True
# can improve performance
# of course, this requires numba installed
use_numba = False,
how = 'left', 
# performance is better when
# sort_by_appearance is False
sort_by_appearance=True)
.assign(in_range = lambda df: df.in_range.mask(df.st_dt.notna(), True))
.iloc[:, :2]
.set_index('index')
)
in_range
index              
2022-11-08     True
2022-11-09     True
2022-11-10     True
2022-11-11      NaN
2022-11-12      NaN
2022-11-13     True
2022-11-14     True
2022-11-15     True
2022-11-16      NaN
2022-11-17      NaN
2022-11-18      NaN

最新更新