我正在寻找优化这种循环欧几里得距离计算的智能方法。这个计算是寻找到所有其他向量的平均距离。
因为我的向量数组实在太大了:eucl_dist = euclidean_distances(eigen_vs_cleaned)我正在逐行运行循环。
典型的eigen_vs_cleaned形状目前至少是(300000,1000),我必须增加更多。(如2000000、10000)
有更聪明的方法吗?
eucl_dist_meaned = np.zeros(eigen_vs_cleaned.shape[0],dtype=float)
from sklearn.metrics.pairwise import euclidean_distances
for z in range(eigen_vs_cleaned.shape[0]):
if z%10000==0:
print(z)
eucl_dist_temp = euclidean_distances(eigen_vs_cleaned[z].reshape(1, -1), eigen_vs_cleaned)
eucl_dist_meaned[z] = eucl_dist_temp.mean(axis=1)
我不是python/numpy专家,但这是我优化的第一步。至少在我的MacPro上运行要好得多。
from joblib import Parallel, delayed
import multiprocessing
import os
import tempfile
import shutil
from sklearn.metrics.pairwise import euclidean_distances
# Creat a temporary directory and define the array pat
path = tempfile.mkdtemp()
out_path = os.path.join(path,'out.mmap')
out = np.memmap(out_path, dtype=float, shape=eigen_vs_cleaned.shape[0], mode='w+')
eucl_dist_meaned = np.zeros(eigen_vs_cleaned.shape[0],dtype=float)
num_cores = multiprocessing.cpu_count()
def runparallel(row, out):
if row%10000==0:
print(row)
eucl_dist_temp = euclidean_distances(eigen_vs_cleaned[row].reshape(1, -1), eigen_vs_cleaned)
out[row] = eucl_dist_temp.mean(axis=1)
##
nothing = Parallel(n_jobs=num_cores)(delayed(runparallel)(r, out) for r in range(eigen_vs_cleaned.shape[0]))
然后保存输出:
eucl_dist_meaned = np.array(out,copy=True,dtype=float)