我有一个结构如下的数据帧(实际数据帧要大得多(:
ID Side Start Time End Time Last Action
0 B Bid 2020-10-13 14:40:32 2020-10-23 17:00:05 Cancelled
1 A Bid 2020-10-19 11:44:02 2020-10-23 17:00:05 Cancelled
2 A Bid 2020-10-20 09:43:15 2020-10-23 17:00:05 Cancelled
3 B Bid 2020-10-20 16:56:03 2020-10-23 17:00:05 Cancelled
4 B Bid 2020-10-26 14:32:26 2020-10-28 10:18:31 Dealt
5 A Bid 2020-10-26 16:10:48 2020-10-29 16:49:27 Dealt
6 A Bid 2020-10-26 16:59:00 2020-10-27 16:55:35 Cancelled
7 A Ask 2020-10-27 09:39:58 2020-10-28 13:33:18 Updated
8 A Bid 2020-10-27 13:33:57 2020-10-27 13:42:18 Dealt
9 A Bid 2020-10-27 13:47:19 2020-10-27 16:40:30 Dealt
10 A Bid 2020-10-27 16:59:09 2020-10-28 10:19:51 Cancelled
11 B Ask 2020-10-28 09:02:33 2020-10-28 11:12:58 Cancelled
12 B Bid 2020-10-28 10:24:02 2020-10-28 14:44:44 Dealt
13 B Ask 2020-10-28 11:04:57 2020-10-28 11:07:12 Cancelled
14 B Ask 2020-10-28 11:09:58 2020-10-28 11:41:33 Cancelled
15 B Ask 2020-10-28 12:04:10 2020-10-28 12:10:52 Cancelled
16 B Ask 2020-10-28 13:24:10 2020-10-28 13:46:15 Cancelled
17 B Ask 2020-10-28 13:46:36 2020-10-28 13:48:18 Cancelled
18 B Ask 2020-10-28 14:49:05 2020-10-28 16:13:09 Cancelled
19 B Ask 2020-10-28 16:14:54 2020-10-28 16:14:59 Dealt
20 B Bid 2020-10-29 10:09:27 2020-10-29 10:39:08 Dealt
21 B Ask 2020-10-29 10:49:02 2020-10-29 10:49:58 Cancelled
22 B Ask 2020-10-29 13:13:55 2020-10-29 17:00:00 Updated
23 B Bid 2020-10-29 15:25:48 2020-10-29 15:31:09 Dealt
24 B Bid 2020-10-29 16:59:22 2020-10-29 17:00:20 Cancelled
使用此数据帧,我将执行以下操作:按列ID
分组,我将标记满足以下条件的两行的实例:
- 时间间隔
Start Time
和End Time
重叠 - 双方相反(买入/卖出或卖出/买入(
- 其中一行的
Last Action
为Dealt
,另一行为Cancelled
(无论是否进行出价/询价(
我已经设法计算了时间框架的重叠,但我似乎没有设法添加上面列出的三个条件。
提前感谢!
您的需求使这项操作相当复杂。让我先带你看一个缩小的例子:
# Let's consider only rows whose ID == 'B'
group = df[df['ID'] == 'B']
# It is a lot easier to work with position-based index in numpy vs. label-based
# index in pandas (and faster too). We first need to convert the relevant
# columns to a 2D numpy array.
tmp = group[['Start Time', 'End Time', 'Side', 'Last Action']].to_numpy()
# Remembering the position of all the columns are hard. Let's assign the
# positions to constants to aid readability
START_TIME, END_TIME, SIDE, LAST_ACTION = range(4)
# Check for time range overlap using numpy broadcasting
st = tmp[:, START_TIME]
et = tmp[:, END_TIME]
is_overlap = (st <= et[:, None]) & (st[:, None] <= et)
# Extract the indices of the overlapping pairs. For example:
# idx1 = [4, 12]
# idx2 = [11, 17]
# => row #4 in `tmp` overlaps with row #11 in `tmp`
# row #12 in `tmp` overlaps with row #17 in `tmp`
idx1, idx2 = is_overlap.nonzero()
# Now moving on to your other conditions
mask = (
# If row i overlaps row j then row j also overlaps row i.
# We want to avoid the double-counting
(idx1 < idx2) &
# I assume there are only 2 possible sides: Bid or Ask
(tmp[idx1, SIDE] != tmp[idx2, SIDE]) &
(
# But there are more than 2 actions. We must list all the combinations
# manually
(tmp[idx1, LAST_ACTION] == 'Dealt') & (tmp[idx2, LAST_ACTION] == 'Cancelled') |
(tmp[idx1, LAST_ACTION] == 'Cancelled') & (tmp[idx2, LAST_ACTION] == 'Dealt')
)
)
# Extract the indices of the rows that meet all conditions
i, = mask.nonzero()
# Now pair them up for the final result
pd.concat([
group.iloc[idx1[i]].reset_index().add_suffix(' 1'),
group.iloc[idx2[i]].reset_index().add_suffix(' 2')
], axis=1)
解决方案
将上面的所有代码放入一个函数中,然后从groupby('ID').apply
:调用它
def get_pairs(group):
tmp = group[['Start Time', 'End Time', 'Side', 'Last Action']].to_numpy()
START_TIME, END_TIME, SIDE, LAST_ACTION = range(4)
# Check for time range overlap using numpy broadcasting
st = tmp[:, START_TIME]
et = tmp[:, END_TIME]
is_overlap = (st <= et[:, None]) & (st[:, None] <= et)
idx1, idx2 = is_overlap.nonzero()
mask = (
(idx1 < idx2) &
(tmp[idx1, SIDE] != tmp[idx2, SIDE]) &
(
(tmp[idx1, LAST_ACTION] == 'Dealt') & (tmp[idx2, LAST_ACTION] == 'Cancelled') |
(tmp[idx1, LAST_ACTION] == 'Cancelled') & (tmp[idx2, LAST_ACTION] == 'Dealt')
)
)
# Extract the indices of the rows that meet all conditions
i, = mask.nonzero()
# Now pair them up for the final result
return pd.concat([
group.iloc[idx1[i]].reset_index().add_suffix(' 1'),
group.iloc[idx2[i]].reset_index().add_suffix(' 2')
], axis=1)
df.groupby('ID').apply(get_pairs).droplevel(1)
复杂性为O(n^2)
,其中n
是数据帧中的行数。最复杂的操作是重叠检查,因为您必须将每一行与其他行进行比较。