df: 生成以下两个数据帧:
我有以下代码数据…
import pandas as pd, numpy as np
from datetime import datetime
end_dt = datetime.today()
st_dt = (end_dt + pd.DateOffset(-10)).date()
df_index = pd.date_range(st_dt, end_dt)
df = pd.DataFrame(index=df_index, columns=['in_range'])
data = [pd.to_datetime(['2022-11-08','2022-11-10']), pd.to_datetime(['2022-11-13','2022-11-15'])]
dt_ranges = pd.DataFrame(data,columns={'st_dt':'datetimens[64]', 'end_dt': 'datetimens[64]'})
df: 生成以下两个数据帧:
in_range
2022-11-08 NaN
2022-11-09 NaN
2022-11-10 NaN
2022-11-11 NaN
2022-11-12 NaN
2022-11-13 NaN
2022-11-14 NaN
2022-11-15 NaN
2022-11-16 NaN
2022-11-17 NaN
2022-11-18 NaN
和date_ranges:
st_dt end_dt
0 2022-11-08 2022-11-10
1 2022-11-13 2022-11-15
我想更新'in_range'列,以指示索引是否落在第二个数据帧的任何一对开始和结束日期内。所以我应该这样结束:
in_range
2022-11-08 True
2022-11-09 True
2022-11-10 True
2022-11-11 NaN
2022-11-12 NaN
2022-11-13 True
2022-11-14 True
2022-11-15 True
2022-11-16 NaN
2022-11-17 NaN
2022-11-18 NaN
我已经走下尝试使用lambda和迭代来做到这一点的路径。但对我来说,这似乎是有效的。
def in_range(index_date, date_ranges):
for r in date_ranges.values:
if (r[0] >= index_date) & (r[1] <= index_date):
return True
return False
df['in_range'] = df.reset_index().apply(lambda x: in_range(x.date, dt_ranges), axis=1)
上面的总是将in_range设置为nan,尽管代码返回正确的值。我怀疑这是因为我重置了索引,所以它不能对齐。另外,正如前面提到的,这个解决方案可能非常低效
是否有更python/流行病的方法来做到这一点?
使用merge_asof
和布尔索引:
s = df.index.to_series()
m = (pd.merge_asof(s.rename('st_dt'), dt_ranges)
['end_dt'].ge(s.to_numpy()).to_numpy()
)
df.loc[m, 'in_range'] = True
NB。dt_ranges
中的间隔应不重叠
输出:
in_range
2022-11-08 True
2022-11-09 True
2022-11-10 True
2022-11-11 NaN
2022-11-12 NaN
2022-11-13 True
2022-11-14 True
2022-11-15 True
2022-11-16 NaN
2022-11-17 NaN
2022-11-18 NaN
一个选项是使用conditional_join计算非对等连接,它可以处理重叠:
# pip install pyjanitor
import pandas as pd
import janitor
(
df
.reset_index()
.conditional_join(
dt_ranges,
('index', 'st_dt', '>='),
('index', 'end_dt', '<='),
# depending on your data size
# setting use_numba to True
# can improve performance
# of course, this requires numba installed
use_numba = False,
how = 'left',
# performance is better when
# sort_by_appearance is False
sort_by_appearance=True)
.assign(in_range = lambda df: df.in_range.mask(df.st_dt.notna(), True))
.iloc[:, :2]
.set_index('index')
)
in_range
index
2022-11-08 True
2022-11-09 True
2022-11-10 True
2022-11-11 NaN
2022-11-12 NaN
2022-11-13 True
2022-11-14 True
2022-11-15 True
2022-11-16 NaN
2022-11-17 NaN
2022-11-18 NaN