如何加快对向量集合之间的余弦相似性的计算



i有一组向量(〜30k(,每个向量由FastText生成的300个元素组成,每个向量代表实体的含义,我想计算相似性所有实体,所以我在嵌套物质中迭代向量,o(n^2(复杂性,这在时间上是不切实际的。

您可以向我推荐另一种计算此方法的方法,还是可以并行化?

def calculate_similarity(v1, v2):
    """
    Calculate cosine distance between two vectors
    """
    n1 = np.linalg.norm(v1)
    n2 = np.linalg.norm(v2)
    return np.dot(v1, v2) / n1 / n2

similarities = {}
for ith_entity, ith_vector in vectors.items():
    for jth_entity, jth_vector in vectors.items():
        if ith_entity == jth_entity:
            continue
        if (ith_entity, jth_entity) in similarities.keys() or (jth_entity, ith_entity) in similarities.keys():
            continue
        similarities[(ith_entity, jth_entity)] = calculate_similarity(ith_vector, jth_vector)

您可以通过使用scipy的距离模块来摆脱慢速的嵌套环。

给定 vectors = {'k1':v1, 'k2':v2, ..., 'km':vm}vi是长度n。

的python列表
import numpy as np 
from scipy.spatial import distance
# transfrom vectors to m x n numpy array 
data = np.array(list(vectors.values())
# compute pairwise cosine distance 
pws = distance.pdist(data, metric='cosine')

pws是凝结距离矩阵。它是一维,按以下顺序保持距离:

pws = np.array([ (k1, k2), (k1, k3), (k1, k4), ..., (k1, km),
                           (k2, k3), (k2, k4), ..., (k2, km),
                                      ...,
                                                   (km-1, km) ])

还请注意,distance.pdist计算余弦距离而不是余弦相似性。

我对矢量进行矢量化。

import numpy as np
from itertools import combinations
np.random.seed(1)
vector_data = np.random.randn(3, 3)
v1, v2, v3 = vector_data[0], vector_data[1], vector_data[2]
def similarities_vectorized(vector_data):
    norms = np.linalg.norm(vector_data, axis=1)
    combs = np.stack(combinations(range(vector_data.shape[0]),2))
    similarities = (vector_data[combs[:,0]]*vector_data[combs[:,1]]).sum(axis=1)/norms[combs][:,0]/norms[combs][:,1]
    return combs, similarities
combs, similarities = similarities_vectorized(vector_data)
for comb, similarity in zip(combs, similarities):
    print(comb, similarity)

输出:

[0 1] -0.217095007411
[0 2] 0.894174618451
[1 2] -0.630555641519

将结果与问题的代码进行比较:

def calculate_similarity(v1, v2):
    """
    Calculate cosine distance between two vectors
    """
    n1 = np.linalg.norm(v1)
    n2 = np.linalg.norm(v2)
    return np.dot(v1, v2) / n1 / n2
def calculate_simularities(vectors):
    similarities = {}
    for ith_entity, ith_vector in vectors.items():
        for jth_entity, jth_vector in vectors.items():
            if ith_entity == jth_entity:
                continue
            if (ith_entity, jth_entity) in similarities.keys() or (jth_entity, ith_entity) in similarities.keys():
                continue
            similarities[(ith_entity, jth_entity)] = calculate_similarity(ith_vector, jth_vector)
    return similarities
vectors = {'A': v1, 'B': v2, 'C': v3}
print(calculate_simularities(vectors))

输出:

{('A', 'B'): -0.21709500741113338, ('A', 'C'): 0.89417461845058566, ('B', 'C'): -0.63055564151883581}

当我在一组300个向量上运行时,矢量化版本的速度约为3.3倍。

更新:

此版本比原始版本快50倍:

def similarities_vectorized2(vector_data):
    norms = np.linalg.norm(vector_data, axis=1)
    combs = np.fromiter(combinations(range(vector_data.shape[0]),2), dtype='i,i')
    similarities = (vector_data[combs['f0']]*vector_data[combs['f1']]).sum(axis=1)/norms[combs['f0']]/norms[combs['f1']]
    return combs, similarities
combs, similarities = similarities_vectorized2(vector_data)
for comb, similarity in zip(combs, similarities):
    print(comb, similarity)

输出:

(0, 1) -0.217095007411
(0, 2) 0.894174618451
(1, 2) -0.630555641519

使用球树,我在具有形状的非常大的特征向量上使用了它(16460,4096(。首先使用下面的块构造一棵树

from sklearn.neighbors import BallTree
tree = BallTree(features_tsvd, metric = spatial.distance.cosine)

现在在树上搜索查询,尝试这样的东西:

dists, ind = tree.query(query, k=10)

相关内容

  • 没有找到相关文章

最新更新