Python中有效的邻近距离矩阵



我需要一个内存& &;在Python中计算1到10维中大约50000个点之间的距离的省时方法。到目前为止,我尝试的方法都不是很好;到目前为止,我尝试了:

  • scipy.spatial.distance.pdist计算全距离矩阵
  • scipy.spatial.KDTree.sparse_distance_matrix计算稀疏距离矩阵直到阈值

令我惊讶的是,sparse_distance_matrix表现不佳。我使用的示例是从单位5维球中均匀选择5000个点,其中pdist在0.113秒内返回结果,sparse_distance_matrix在44.966秒内返回结果,当我让它使用阈值0.1作为最大距离截止时。

在这一点上,我只会坚持使用pdist,但对于50000点,它将使用2.5 x 10^9个条目的numpy数组,我担心它是否会过载运行时(?)内存。

有没有人知道更好的方法,或者看到我的实现中明显的错误?提前感谢!

以下是在Python3中重现输出所需的内容:

import numpy as np
import math
import time
from scipy.spatial.distance import pdist
from scipy.spatial import KDTree as kdtree
# Generate a uniform sample of size N on the unit dim-dimensional sphere (which lives in dim+1 dimensions)
def sphere(N, dim):
# Get a random sample of points from the (dim+1)-dim. Gaussian.
output = np.random.multivariate_normal(mean=np.zeros(dim+1), cov=np.identity(dim+1), size=N)
# Normalize output
output = output / np.linalg.norm(output, axis=1).reshape(-1,1)
return output
# Generate a uniform sample of size N on the unit dim-dimensional ball.
def ball(N, dim):
# Populate the points on the unit sphere that is the boundary.
sphere_points = sphere(N, dim-1)
# Randomize radii of the points on the sphere using power law to get a uniform distribution on the ball.
radii = np.power(np.random.random(N), 1/dim)
output = radii.reshape(-1, 1) * sphere_points
return output
N = 5000
dim = 5
r_cutoff = 0.1
# Generate a sample to test
sample = ball(N, dim)
# Construct a KD Tree for the sample
sample_kdt = kdtree(sample)
# pdist method for distance matrix
tic = time.monotonic()
pdist(sample)
toc = time.monotonic()
print(f"Time taken from pdist = {toc-tic}")
# KD Tree method for distance matrix
tic = time.monotonic()
sample_kdt.sparse_distance_matrix(sample_kdt, r_cutoff)
toc = time.monotonic()
print(f"Time taken from the KDTree method = {toc-tic}")
import numpy as np
from sklearn.neighbors import BallTree
tic = time.monotonic()
tree = BallTree(sample, leaf_size=10)       
d,i = tree.query(sample, k=1)
toc = time.monotonic()
print(f"Time taken from Sklearn BallTree = {toc-tic}")

这个程序在我的机器上执行Time taken from Sklearn BallTree = 0.30803330009803176pdist只做了一秒钟多一点。注意:我正在做一些繁重的计算,我的机器上有3/4个内核。

取最近的k=1


对于半径0.1

import numpy as np
from sklearn.neighbors import BallTree
tic = time.monotonic()
tree = BallTree(sample, leaf_size=10)       
i = tree.query_radius(sample, r=0.1)
toc = time.monotonic()
print(f"Time taken from Sklearn BallTree Radius = {toc-tic}")

与速度

Time taken from Sklearn BallTree Radius = 0.11115029989741743

最新更新