从numpy距离数组中提取N个最接近的对

我有一个大的、对称的2D距离数组。我想得到最接近的N对观测值。

该数组存储为numpy压缩数组，具有大约1亿个观测值。

这里有一个例子，可以在一个较小的阵列上获得100个最近的距离（~500k的观测值），但它比我想要的要慢得多。

import numpy as np
import random
import sklearn.metrics.pairwise
import scipy.spatial.distance
N = 100
r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)])
c = r[:, None]
dists = scipy.spatial.distance.pdist(c, 'cityblock')
# these are the indices of the closest N observations
closest = dists.argsort()[:N]
# but it's really slow to get out the pairs of observations
def condensed_to_square_index(n, c):
    # converts an index in a condensed array to the 
    # pair of observations it represents
    # modified from here: http://stackoverflow.com/questions/5323818/condensed-matrix-function-to-find-pairs
    ti = np.triu_indices(n, 1)
    return ti[0][c]+ 1, ti[1][c]+ 1
r = []
n = np.ceil(np.sqrt(2* len(dists)))
for i in closest:
    pair = condensed_to_square_index(n, i)
    r.append(pair)

在我看来，必须有更快的方法来使用标准numpy或scipy函数，但我被难住了。

注意：如果很多对是等距的，那没关系，在这种情况下我不在乎它们的排序。

您不需要在每次对condensed_to_square_index的调用中计算ti。这里有一个只计算一次的基本修改：

import numpy as np
import random
import sklearn.metrics.pairwise
import scipy.spatial.distance
N = 100
r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)])
c = r[:, None]
dists = scipy.spatial.distance.pdist(c, 'cityblock')
# these are the indices of the closest N observations
closest = dists.argsort()[:N]
# but it's really slow to get out the pairs of observations
def condensed_to_square_index(ti, c):
    return ti[0][c]+ 1, ti[1][c]+ 1
r = []
n = np.ceil(np.sqrt(2* len(dists)))
ti = np.triu_indices(n, 1)
for i in closest:
    pair = condensed_to_square_index(ti, i)
    r.append(pair)

您还可以矢量化r:的创建

r  = zip(ti[0][closest] + 1, ti[1][closest] + 1)

或

r = np.vstack(ti)[:, closest] + 1

如果您使用的是使用np.partition:的numpy 1.8，则可以非常显著地加快最小值的位置

def smallest_n(a, n):
    return np.sort(np.partition(a, n)[:n])
def argsmallest_n(a, n):
    ret = np.argpartition(a, n)[:n]
    b = np.take(a, ret)
    return np.take(ret, np.argsort(b))
dists = np.random.rand(1000*999//2) # a pdist array
In [3]: np.all(argsmallest_n(dists, 100) == np.argsort(dists)[:100])
Out[3]: True
In [4]: %timeit np.argsort(dists)[:100]
10 loops, best of 3: 73.5 ms per loop
In [5]: %timeit argsmallest_n(dists, 100)
100 loops, best of 3: 5.44 ms per loop

一旦你有了最小的索引，你就不需要一个循环来提取索引，只需一次：

closest = argsmallest_n(dists, 100)
tu = np.triu_indices(1000, 1)
pairs = np.column_stack((np.take(tu[0], closest),
                         np.take(tu[1], closest))) + 1

最佳解决方案可能不会生成所有距离。

建议：

制作一个最大大小为100的堆（如果它变大，就减少它）
使用最近对算法查找最近的对
将该对添加到堆（优先级队列）中
选一双。将其99个最近的邻居添加到堆中
从列表中删除选定的点
找到下一对最接近的，然后重复。添加的邻居数量是100减去运行最近对算法的次数

您可以使用pandas DataFrame。首先，将相似性矩阵（例如sklearn中的pairwise_dinstances（））声明为DataFrame，添加源数据中的列名和索引名。然后按名称选择任意列（这是您感兴趣的列），然后使用panda。DataFrame.sort_values（），然后选择前5名或前10名。就是这样。

相关内容

最新更新

热门标签：