我有一个结构如下的数据帧(实际数据帧要大得多(：

ID Side          Start Time            End Time Last Action
0   B  Bid 2020-10-13 14:40:32 2020-10-23 17:00:05   Cancelled
1   A  Bid 2020-10-19 11:44:02 2020-10-23 17:00:05   Cancelled
2   A  Bid 2020-10-20 09:43:15 2020-10-23 17:00:05   Cancelled
3   B  Bid 2020-10-20 16:56:03 2020-10-23 17:00:05   Cancelled
4   B  Bid 2020-10-26 14:32:26 2020-10-28 10:18:31       Dealt
5   A  Bid 2020-10-26 16:10:48 2020-10-29 16:49:27       Dealt
6   A  Bid 2020-10-26 16:59:00 2020-10-27 16:55:35   Cancelled
7   A  Ask 2020-10-27 09:39:58 2020-10-28 13:33:18     Updated
8   A  Bid 2020-10-27 13:33:57 2020-10-27 13:42:18       Dealt
9   A  Bid 2020-10-27 13:47:19 2020-10-27 16:40:30       Dealt
10  A  Bid 2020-10-27 16:59:09 2020-10-28 10:19:51   Cancelled
11  B  Ask 2020-10-28 09:02:33 2020-10-28 11:12:58   Cancelled
12  B  Bid 2020-10-28 10:24:02 2020-10-28 14:44:44       Dealt
13  B  Ask 2020-10-28 11:04:57 2020-10-28 11:07:12   Cancelled
14  B  Ask 2020-10-28 11:09:58 2020-10-28 11:41:33   Cancelled
15  B  Ask 2020-10-28 12:04:10 2020-10-28 12:10:52   Cancelled
16  B  Ask 2020-10-28 13:24:10 2020-10-28 13:46:15   Cancelled
17  B  Ask 2020-10-28 13:46:36 2020-10-28 13:48:18   Cancelled
18  B  Ask 2020-10-28 14:49:05 2020-10-28 16:13:09   Cancelled
19  B  Ask 2020-10-28 16:14:54 2020-10-28 16:14:59       Dealt
20  B  Bid 2020-10-29 10:09:27 2020-10-29 10:39:08       Dealt
21  B  Ask 2020-10-29 10:49:02 2020-10-29 10:49:58   Cancelled
22  B  Ask 2020-10-29 13:13:55 2020-10-29 17:00:00     Updated
23  B  Bid 2020-10-29 15:25:48 2020-10-29 15:31:09       Dealt
24  B  Bid 2020-10-29 16:59:22 2020-10-29 17:00:20   Cancelled

使用此数据帧，我将执行以下操作：按列ID分组，我将标记满足以下条件的两行的实例：

时间间隔Start Time和End Time重叠
双方相反(买入/卖出或卖出/买入(
其中一行的Last Action为Dealt，另一行为Cancelled(无论是否进行出价/询价(

我已经设法计算了时间框架的重叠，但我似乎没有设法添加上面列出的三个条件。

提前感谢！

您的需求使这项操作相当复杂。让我先带你看一个缩小的例子：

# Let's consider only rows whose ID == 'B'
group = df[df['ID'] == 'B']
# It is a lot easier to work with position-based index in numpy vs. label-based
# index in pandas (and faster too). We first need to convert the relevant
# columns to a 2D numpy array.
tmp = group[['Start Time', 'End Time', 'Side', 'Last Action']].to_numpy()
# Remembering the position of all the columns are hard. Let's assign the
# positions to constants to aid readability
START_TIME, END_TIME, SIDE, LAST_ACTION = range(4)
# Check for time range overlap using numpy broadcasting
st = tmp[:, START_TIME]
et = tmp[:, END_TIME]
is_overlap = (st <= et[:, None]) & (st[:, None] <= et)
# Extract the indices of the overlapping pairs. For example:
#   idx1 = [4, 12]
#   idx2 = [11, 17]
#   => row #4 in `tmp` overlaps with row #11 in `tmp`
#      row #12 in `tmp` overlaps with row #17 in `tmp`
idx1, idx2 = is_overlap.nonzero()
# Now moving on to your other conditions
mask = (
# If row i overlaps row j then row j also overlaps row i.
# We want to avoid the double-counting
(idx1 < idx2) &
# I assume there are only 2 possible sides: Bid or Ask
(tmp[idx1, SIDE] != tmp[idx2, SIDE]) &
(
# But there are more than 2 actions. We must list all the combinations
# manually
(tmp[idx1, LAST_ACTION] == 'Dealt') & (tmp[idx2, LAST_ACTION] == 'Cancelled') |
(tmp[idx1, LAST_ACTION] == 'Cancelled') & (tmp[idx2, LAST_ACTION] == 'Dealt')
)
)
# Extract the indices of the rows that meet all conditions
i, = mask.nonzero()
# Now pair them up for the final result
pd.concat([
group.iloc[idx1[i]].reset_index().add_suffix(' 1'),
group.iloc[idx2[i]].reset_index().add_suffix(' 2')
], axis=1)

解决方案

将上面的所有代码放入一个函数中，然后从groupby('ID').apply:调用它

def get_pairs(group):
tmp = group[['Start Time', 'End Time', 'Side', 'Last Action']].to_numpy()
START_TIME, END_TIME, SIDE, LAST_ACTION = range(4)
# Check for time range overlap using numpy broadcasting
st = tmp[:, START_TIME]
et = tmp[:, END_TIME]
is_overlap = (st <= et[:, None]) & (st[:, None] <= et)
idx1, idx2 = is_overlap.nonzero()
mask = (
(idx1 < idx2) &
(tmp[idx1, SIDE] != tmp[idx2, SIDE]) &
(
(tmp[idx1, LAST_ACTION] == 'Dealt') & (tmp[idx2, LAST_ACTION] == 'Cancelled') |
(tmp[idx1, LAST_ACTION] == 'Cancelled') & (tmp[idx2, LAST_ACTION] == 'Dealt')
)
)
# Extract the indices of the rows that meet all conditions
i, = mask.nonzero()
# Now pair them up for the final result
return pd.concat([
group.iloc[idx1[i]].reset_index().add_suffix(' 1'),
group.iloc[idx2[i]].reset_index().add_suffix(' 2')
], axis=1)
df.groupby('ID').apply(get_pairs).droplevel(1)

复杂性为O(n^2)，其中n是数据帧中的行数。最复杂的操作是重叠检查，因为您必须将每一行与其他行进行比较。

用进一步的条件标记重叠的日期时间范围

解决方案

相关内容

最新更新

热门标签：