我有以下玩具示例数据帧,df
:
f_low f_high
0.476201 0.481915
0.479161 0.484977
0.485997 0.491911
0.503259 0.508679
0.504687 0.510075
0.504687 0.670075
0.666093 0.670438
0.765602 0.770028
0.766884 0.771307
0.775986 0.780398
0.794590 0.798965
找到这个的重叠子集,我使用以下代码:
df = df.sort_values('f_low')
for row in df.itertuples():
iix = pd.IntervalIndex.from_arrays(df.f_low, df.f_high, closed='neither')
span_range = pd.Interval(row.f_low, row.f_high)
fx = df[(iix.overlaps(span_range))].copy()
我想得到重叠的数据帧像这样:
# iteration 1: over row.f_low=0.476201 row.f_high=0.481915
f_low f_high
0.476201 0.481915
0.479161 0.484977
# iteration 2: over row.f_low=0.503259 row.f_high=0.508679
f_low f_high
0.503259 0.508679
0.504687 0.510075
0.504687 0.670075
# iteration 3: over row.f_low=0.504687 row.f_high=0.670075
f_low f_high
0.666093 0.670438
等。
这工作得很好,但是由于数据帧非常大,并且有很多重叠,这需要很长时间来处理。此外,当我对熊猫使用Interval
和overlaps
方法时,我正在测试的重叠间隔不会抓住自己。
这代表的是一系列重叠的置信区间,每一行都被迭代。
除了迭代所有元组之外,是否有一种方法可以更有效地提取针对给定区间的重叠区间?
下面是未排序的实际数据帧的块:
f_low f_high
0.504687 0.670075
0.476201 0.481915
0.765602 0.770028
0.479161 0.484977
0.766884 0.771307
0.485997 0.491911
0.666093 0.670438
0.503259 0.508679
0.775986 0.780398
0.504687 0.510075
0.794590 0.798965
连续重叠
将"f_low"
值作为入口点并赋值为1
。将"f_high"
值作为出口点,并赋值为-1
。如果我们以递增的顺序处理所有的值,并累积分配的值,那么当累积值大于零时,我们将有一个重叠的区间。如果累计值达到零,我们知道我们已经退出了任何重叠间隔。
注意:将所有连续重叠的区间分组。如果一个区间不与第一个BUT重叠与链中的最后一个重叠,则视为重叠。
我将为这个解决方案下面的另一个选项提供一个类似的解决方案。
<<BK_HR>未遂例子/h3># 1 3 (Interval from 1 to 3)
# 2 5 (Interval from 2 to 5)
# 7 9 (Interval from 7 to 9)
# 1 1 -1 -1 1 -1 (Entry/Exit values)
# 1 2 1 0 1 0 (Accumulated values)
# ⇑ ⇑
# zero indicates leaving all overlaps
# 1 3 (Interval from 1 to 3)
# 2 5 (Interval from 2 to 5)
# 7 9 (Interval from 7 to 9)
# 1 1 -1 -1 1 -1 (Entry/Exit values)
# 1 2 1 0 1 0 (Accumulated values)
# ⇑ ⇑
# zero indicates leaving all overlaps
这表明,一旦我们进入从1
到3
的区间,我们不会离开所有重叠的区间,直到我们到达5
,即从2
到5
的区间的右侧,由累积值达到零来表示。
我将使用生成器返回具有重叠间隔的原始数据帧的索引列表。
当所有的说了和做了,这应该是N * Log(N)
的排序。
def gen_overlaps(df):
df = df.sort_values('f_low')
# get sorter lows and highs
a = df.to_numpy().ravel().argsort()
# get free un-sorter
b = np.empty_like(a)
b[a] = np.arange(len(a))
# get ones and negative ones
# to indicate entering into
# and exiting an interval
c = np.ones(df.shape, int) * [1, -1]
# if we sort by all values and
# accumulate when we enter and exit
# the accumulated value should be
# zero when there are no overlaps
d = c.ravel()[a].cumsum()[b].reshape(df.shape)
# ⇑ ⇑
# sort by value order unsort to get back to original order
indices = []
for i, indicator in zip(df.index, d[:, 1] == 0):
indices.append(i)
if indicator:
yield indices
indices = []
if indices:
yield indices
然后我将使用pd.concat
来组织它们以显示我的意思。k
为kth
组。有些组只有一个区间
pd.concat({
k: df.loc[i] for k, i in
enumerate(gen_overlaps(df))
})
f_low f_high
0 0 0.476201 0.481915
1 0.479161 0.484977
1 2 0.485997 0.491911
2 3 0.503259 0.508679
4 0.504687 0.510075
5 0.504687 0.670075
6 0.666093 0.670438
3 7 0.765602 0.770028
8 0.766884 0.771307
4 9 0.775986 0.780398
5 10 0.794590 0.798965
如果我们只想要那些重叠的…
pd.concat({
k: df.loc[i] for k, i in
enumerate(gen_overlaps(df))
if len(i) > 1
})
f_low f_high
0 0 0.476201 0.481915
1 0.479161 0.484977
2 3 0.503259 0.508679
4 0.504687 0.510075
5 0.504687 0.670075
6 0.666093 0.670438
3 7 0.765602 0.770028
8 0.766884 0.771307
在队列
中只重叠下一个间隔这是一个更简单的解决方案,符合OPs期望的输出。
def gen_overlaps(df):
df = df.sort_values('f_low')
indices = []
cursor = None
for i, low, high in df.itertuples():
if not indices:
cursor = high
if low <= cursor:
indices.append(i)
else:
yield indices
indices = []
cursor = high
if len(indices) > 1:
yield indices
pd.concat({
k: df.loc[i] for k, i in
enumerate(gen_overlaps(df))
})
f_low f_high
0 0 0.476201 0.481915
1 0.479161 0.484977
1 3 0.503259 0.508679
4 0.504687 0.510075
5 0.504687 0.670075
2 7 0.765602 0.770028
8 0.766884 0.771307
如果我理解正确的话,您希望将当前df分隔为数据帧,其中初始间隔由第一行设置,第二个间隔由不相交的第一行定义,等等。下面的方法可以做到这一点,如果组的数量不是太大,应该是相当有效的:
df = df.sort_values("f_low").reset_index(drop=True)
idx = 0
dfs = []
while True:
low = df.f_low[idx]
high = df.f_high[idx]
sub_df = df[(df.f_low <= high) & (low <= df.f_low)]
dfs.append(sub_df)
idx = sub_df.index.max() + 1
if idx > df.index.max():
break
输出:
[ f_low f_high
0 0.476201 0.481915
1 0.479161 0.484977,
f_low f_high
2 0.485997 0.491911,
f_low f_high
3 0.503259 0.508679
4 0.504687 0.510075
5 0.504687 0.670075,
f_low f_high
6 0.666093 0.670438,
f_low f_high
7 0.765602 0.770028
8 0.766884 0.771307,
f_low f_high
9 0.775986 0.780398,
f_low f_high
10 0.79459 0.798965]
这样行吗?
intervals = df.apply(lambda row: pd.Interval(row['f_low'], row['f_high']), axis=1)
overlaps = [
(i, j, x, y, x.overlaps(y))
for ((i,x),(j,y))
in itertools.product(enumerate(intervals), repeat=2)
]
>>> overlaps[:3]
[(0,
0,
Interval(0.47620100000000004, 0.481915, closed='right'),
Interval(0.47620100000000004, 0.481915, closed='right'),
True),
(0,
1,
Interval(0.47620100000000004, 0.481915, closed='right'),
Interval(0.47916099999999995, 0.48497700000000005, closed='right'),
True),
(0,
2,
Interval(0.47620100000000004, 0.481915, closed='right'),
Interval(0.485997, 0.491911, closed='right'),
False)]
从这里可以得到原始DataFrame中的数字索引。我不确定它的性能如何,但它应该比你现在的要好。
使用numpy的数组广播:
l1 = df['f_low'].to_numpy()
h1 = df['f_high'].to_numpy()
l2 = l1[:, None]
h2 = h1[:, None]
# Check for overlap
# mask is an n * n matrix indicating if interval i overlaps with interval j
mask = (l1 < h2) & (h1 > l2)
# If interval i overlaps intervla j then j also overlaps i. We only want to get
# one of the two pairs. Hence the `triu` (triangle, upper)
# Every interval also overlaps itself and we don't want that either. Hence the k=1
overlaps = np.triu(mask, k=1).nonzero()
overlaps
的结果需要一些解释:
(array([0, 3, 3, 4, 5, 7]),
array([1, 4, 5, 5, 6, 8]))
# Row 0 overlaps with row 1
# Row 3 overlaps with row 4
# Row 3 overlaps with row 5
# ....
我不确定你需要什么样的重叠,但我认为这种方法可以为它工作:
- 确保你的遮罩是足够的。
- 创建一个字典,键值为f_low和f_high。
- 过滤原始数据帧
- 正如你所说,真正的用例应该是一个大数据集,所以
query
必须比.loc
更好
import pandas as pd
df = pd.DataFrame(
[
[0.504687, 0.670075],
[0.476201, 0.481915],
[0.765602, 0.770028],
[0.479161, 0.484977],
[0.766884, 0.771307],
[0.485997, 0.491911],
[0.666093, 0.670438],
[0.503259, 0.508679],
[0.775986, 0.780398],
[0.504687, 0.510075],
[0.794590, 0.798965]
],
columns=["f_low", "f_high"]
)
overlap = {
(row.f_low, row.f_high): df.query("(@row.f_low <= f_low <= @row.f_high) or (@row.f_low <= f_high <= @row.f_high)")
for row in df.itertuples()
}