如何有效地找到重叠区间?

我有以下玩具示例数据帧，df:

f_low    f_high
0.476201  0.481915
0.479161  0.484977
0.485997  0.491911
0.503259  0.508679
0.504687  0.510075
0.504687  0.670075
0.666093  0.670438
0.765602  0.770028
0.766884  0.771307
0.775986  0.780398
0.794590  0.798965

找到这个的重叠子集，我使用以下代码:

df = df.sort_values('f_low')
for row in df.itertuples():
iix = pd.IntervalIndex.from_arrays(df.f_low, df.f_high, closed='neither')
span_range = pd.Interval(row.f_low, row.f_high)
fx = df[(iix.overlaps(span_range))].copy()

我想得到重叠的数据帧像这样:

# iteration 1: over row.f_low=0.476201  row.f_high=0.481915 
f_low    f_high
0.476201  0.481915
0.479161  0.484977
# iteration 2: over row.f_low=0.503259  row.f_high=0.508679 
f_low    f_high
0.503259  0.508679 
0.504687  0.510075
0.504687 0.670075
# iteration 3: over row.f_low=0.504687  row.f_high=0.670075 
f_low    f_high
0.666093  0.670438

等。

这工作得很好，但是由于数据帧非常大，并且有很多重叠，这需要很长时间来处理。此外，当我对熊猫使用Interval和overlaps方法时，我正在测试的重叠间隔不会抓住自己。

这代表的是一系列重叠的置信区间，每一行都被迭代。

除了迭代所有元组之外，是否有一种方法可以更有效地提取针对给定区间的重叠区间?

下面是未排序的实际数据帧的块:

f_low   f_high
0.504687  0.670075
0.476201  0.481915
0.765602  0.770028
0.479161  0.484977
0.766884  0.771307
0.485997  0.491911
0.666093  0.670438
0.503259  0.508679
0.775986  0.780398
0.504687  0.510075
0.794590  0.798965

连续重叠

将"f_low"值作为入口点并赋值为1。将"f_high"值作为出口点，并赋值为-1。如果我们以递增的顺序处理所有的值，并累积分配的值，那么当累积值大于零时，我们将有一个重叠的区间。如果累计值达到零，我们知道我们已经退出了任何重叠间隔。

注意:将所有连续重叠的区间分组。如果一个区间不与第一个BUT重叠与链中的最后一个重叠，则视为重叠。

我将为这个解决方案下面的另一个选项提供一个类似的解决方案。

<<BK_HR>

未遂例子/h3>

#  1     3                     (Interval from 1 to 3)
#     2        5               (Interval from 2 to 5)
#                    7     9   (Interval from 7 to 9)
#  1  1 -1    -1     1    -1   (Entry/Exit values)
#  1  2  1     0     1     0   (Accumulated values)
#              ⇑           ⇑
# zero indicates leaving all overlaps

这表明，一旦我们进入从1到3的区间，我们不会离开所有重叠的区间，直到我们到达5，即从2到5的区间的右侧，由累积值达到零来表示。

我将使用生成器返回具有重叠间隔的原始数据帧的索引列表。

当所有的说了和做了，这应该是N * Log(N)的排序。

def gen_overlaps(df):
df = df.sort_values('f_low')

# get sorter lows and highs
a = df.to_numpy().ravel().argsort()

# get free un-sorter
b = np.empty_like(a)
b[a] = np.arange(len(a))

# get ones and negative ones
# to indicate entering into
# and exiting an interval
c = np.ones(df.shape, int) * [1, -1]

# if we sort by all values and
# accumulate when we enter and exit
# the accumulated value should be 
# zero when there are no overlaps
d = c.ravel()[a].cumsum()[b].reshape(df.shape)
#             ⇑           ⇑
# sort by value order     unsort to get back to original order

indices = []
for i, indicator in zip(df.index, d[:, 1] == 0):
indices.append(i)
if indicator:
yield indices
indices = []
if indices:
yield indices

然后我将使用pd.concat来组织它们以显示我的意思。k为kth组。有些组只有一个区间

pd.concat({
k: df.loc[i] for k, i in
enumerate(gen_overlaps(df))
})
f_low    f_high
0 0   0.476201  0.481915
1   0.479161  0.484977
1 2   0.485997  0.491911
2 3   0.503259  0.508679
4   0.504687  0.510075
5   0.504687  0.670075
6   0.666093  0.670438
3 7   0.765602  0.770028
8   0.766884  0.771307
4 9   0.775986  0.780398
5 10  0.794590  0.798965

如果我们只想要那些重叠的…

pd.concat({
k: df.loc[i] for k, i in
enumerate(gen_overlaps(df))
if len(i) > 1
})
f_low    f_high
0 0  0.476201  0.481915
1  0.479161  0.484977
2 3  0.503259  0.508679
4  0.504687  0.510075
5  0.504687  0.670075
6  0.666093  0.670438
3 7  0.765602  0.770028
8  0.766884  0.771307

在队列

中只重叠下一个间隔这是一个更简单的解决方案，符合OPs期望的输出。

def gen_overlaps(df):
df = df.sort_values('f_low')

indices = []
cursor = None
for i, low, high in df.itertuples():
if not indices:
cursor = high
if low <= cursor:
indices.append(i)
else:
yield indices
indices = []
cursor = high
if len(indices) > 1:
yield indices

pd.concat({
k: df.loc[i] for k, i in
enumerate(gen_overlaps(df))
})
f_low    f_high
0 0  0.476201  0.481915
1  0.479161  0.484977
1 3  0.503259  0.508679
4  0.504687  0.510075
5  0.504687  0.670075
2 7  0.765602  0.770028
8  0.766884  0.771307

如果我理解正确的话，您希望将当前df分隔为数据帧，其中初始间隔由第一行设置，第二个间隔由不相交的第一行定义，等等。下面的方法可以做到这一点，如果组的数量不是太大，应该是相当有效的:

df = df.sort_values("f_low").reset_index(drop=True)
idx = 0
dfs = []
while True:
low = df.f_low[idx]
high = df.f_high[idx]
sub_df = df[(df.f_low <= high) & (low <= df.f_low)]
dfs.append(sub_df)
idx = sub_df.index.max() + 1
if idx > df.index.max():
break

输出:

[      f_low    f_high
0  0.476201  0.481915
1  0.479161  0.484977,
f_low    f_high
2  0.485997  0.491911,
f_low    f_high
3  0.503259  0.508679
4  0.504687  0.510075
5  0.504687  0.670075,
f_low    f_high
6  0.666093  0.670438,
f_low    f_high
7  0.765602  0.770028
8  0.766884  0.771307,
f_low    f_high
9  0.775986  0.780398,
f_low    f_high
10  0.79459  0.798965]

这样行吗?

intervals = df.apply(lambda row: pd.Interval(row['f_low'], row['f_high']), axis=1)
overlaps = [
(i, j, x, y, x.overlaps(y)) 
for ((i,x),(j,y))
in itertools.product(enumerate(intervals), repeat=2)
]
>>> overlaps[:3]
[(0,
0,
Interval(0.47620100000000004, 0.481915, closed='right'),
Interval(0.47620100000000004, 0.481915, closed='right'),
True),
(0,
1,
Interval(0.47620100000000004, 0.481915, closed='right'),
Interval(0.47916099999999995, 0.48497700000000005, closed='right'),
True),
(0,
2,
Interval(0.47620100000000004, 0.481915, closed='right'),
Interval(0.485997, 0.491911, closed='right'),
False)]

从这里可以得到原始DataFrame中的数字索引。我不确定它的性能如何，但它应该比你现在的要好。

使用numpy的数组广播:

l1 = df['f_low'].to_numpy()
h1 = df['f_high'].to_numpy()
l2 = l1[:, None]
h2 = h1[:, None]
# Check for overlap
# mask is an n * n matrix indicating if interval i overlaps with interval j
mask = (l1 < h2) & (h1 > l2)
# If interval i overlaps intervla j then j also overlaps i. We only want to get
# one of the two pairs. Hence the `triu` (triangle, upper)
# Every interval also overlaps itself and we don't want that either. Hence the k=1
overlaps = np.triu(mask, k=1).nonzero()

overlaps的结果需要一些解释:

(array([0, 3, 3, 4, 5, 7]),
array([1, 4, 5, 5, 6, 8]))
# Row 0 overlaps with row 1
# Row 3 overlaps with row 4
# Row 3 overlaps with row 5
# ....

我不确定你需要什么样的重叠，但我认为这种方法可以为它工作:

确保你的遮罩是足够的。
创建一个字典，键值为f_low和f_high。
过滤原始数据帧
正如你所说，真正的用例应该是一个大数据集，所以query必须比.loc更好

import pandas as pd
df = pd.DataFrame(
[
[0.504687, 0.670075],
[0.476201, 0.481915],
[0.765602, 0.770028],
[0.479161, 0.484977],
[0.766884, 0.771307],
[0.485997, 0.491911],
[0.666093, 0.670438],
[0.503259, 0.508679],
[0.775986, 0.780398],
[0.504687, 0.510075],
[0.794590, 0.798965]
],
columns=["f_low", "f_high"]
)
overlap = {
(row.f_low, row.f_high): df.query("(@row.f_low <= f_low <= @row.f_high) or (@row.f_low <= f_high <= @row.f_high)")
for row in df.itertuples()
}

连续重叠

在队列

相关内容

最新更新

热门标签：