我有两个数据帧:df_intervals
包含两个时间戳之间的时间跨度,df_events
包含单个时间戳表示事件
df_events, df_intervals
( Var2 ts
0 bar 2021-02-10 09:04:31
1 bar 2021-01-29 05:56:17
2 bar 2021-01-16 15:59:43
3 bar 2021-01-25 09:40:40
4 bar 2021-01-27 16:44:57
5 bar 2021-01-17 13:28:43
6 bar 2021-02-03 11:46:10
7 bar 2021-02-02 11:16:49
8 bar 2021-01-21 17:12:15
9 bar 2021-01-19 03:44:30,
Var1 start_ts end_ts
0 foo 2021-02-01 20:29:57 2021-02-02 20:29:57
1 foo 2021-02-03 20:29:57 2021-02-04 20:29:57
2 foo 2021-02-04 20:29:57 2021-02-05 20:29:57
3 foo 2021-02-05 20:29:57 2021-02-06 20:29:57
4 foo 2021-02-06 20:29:57 2021-02-07 20:29:57
5 foo 2021-02-07 20:29:57 2021-02-08 20:29:57
6 foo 2021-02-08 20:29:57 2021-02-11 20:29:57
7 foo 2021-02-08 20:29:57 2021-02-10 20:29:57
8 foo 2021-02-10 20:29:57 2021-02-11 20:29:57)
我的目标是左连接df_intervals
到df_events
,如果在df_intervals
中有一个匹配的记录,其中事件时间戳ts
落在start_ts
和end_ts
之间。结果连接将产生一个match
布尔列和一个count
列,其中包含匹配间隔的个数。我遇到的问题是,当df_intervals
中有多个匹配时-在这种情况下,如果至少有一个匹配,match
应该是真的。我已经尝试过pd.merge_asof()
,但我对如何使用direction=
参数与这种查询之间感到困惑。关于如何使用pandas或numpy(不使用sqlite)执行这种复杂的连接有什么建议吗?
df_output
Var2 ts match count
bar 1612947871 True 2
bar 1611899777 False 0
bar 1610812783 False 0
bar 1611567640 False 0
bar 1611765897 False 0
bar 1610890123 False 0
bar 1612352770 False 0
bar 1612264609 True 1
bar 1611249135 False 0
bar 1611027870 False 0
注意-我已经能够在for循环中执行此匹配,但它需要将输出列表放入矩阵中,然后检查是否有True/False值。我正在寻找一个更简单的解决方案的建议
for event in df_events['ts']:
matches = []
for idx, a, b,c in df_intervals.itertuples():
print(idx,a,b,c)
if b <= event <= c:
matches.append(True)
else:
matches.append(False)
print(matches)
生成数据帧的代码:
import pandas as pd
import random
intervals = [['foo', 1612211397, 1612297797],
['foo', 1612384197, 1612470597], ['foo', 1612470597, 1612556997],
['foo', 1612556997 , 1612643397], ['foo', 1612643397, 1612729797], ['foo', 1612729797, 1612816197],
['foo', 1612816197, 1613075397], ['foo', 1612816197, 1612988997], ['foo', 1612988997, 1613075397]]
df_intervals = pd.DataFrame(intervals, columns = ['Var1', 'start_ts', 'end_ts'])
df_intervals['start_ts']= pd.to_datetime(df_intervals['start_ts'],unit='s')
df_intervals['end_ts']= pd.to_datetime(df_intervals['end_ts'],unit='s')
ts = [1612947871, 1611899777, 1610812783, 1611567640, 1611765897, 1610890123, 1612352770, 1612264609,1611249135,
1611027870]
bar = ['bar'] * 10
d = {'Var2': bar,'ts':ts}
df_events = pd.DataFrame(d)
df_events['ts']= pd.to_datetime(df_events['ts'],unit='s')
这将提供与示例
相同的输出df_events['count'] = df_events.apply(lambda r: len(df_intervals.loc[(df_intervals['start_ts']<=r['ts']) & (df_intervals['end_ts']>=r['ts'])]), axis=1)
df_events['match'] = df_events['count'] > 0
您可以使用.apply()
来获取count
列(如果count != 0
,match
为True):
df_events['count'] = df_events.apply(lambda x: sum( (x['ts']>=df_intervals['start_ts']) & (x['ts']<=df_intervals['end_ts'])), axis=1)