如何获得列表与numpy之间的最大相似性值?

我有两个列表，思路是将其中一个列表的每个元素与另一个列表的所有元素进行比较，以提取相似性最大的元素。就像一个搜索引擎。

NLU中使用的变量:

import numpy as np
import nlu
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
match = ['sentence_to_1',
'sentence_to_2',
'sentence_to_3',
...]
match2 = ['sentence_from_1',
'sentence_from_2',
'sentence_from_3',
'sentence_from_1',
...]
pipe = nlu.load('xx.embed_sentence.bert_use_cmlm_multi_base_br')
df = pd.DataFrame({'one': match, 'two': match2})
predictions_1 = pipe.predict(df.one, output_level='document')
predictions_2 = pipe.predict(df.two, output_level='document')
e_col = 'sentence_embedding_bert_use_cmlm_multi_base_br'
predictions_1

output: 
document          sentence_embedding_bert_use_cmlm_multi_base_br
0 sentence_to_1     [0.018291207030415535, -0.05946089327335358, -...
1 sentence_to_2     [0.04855785518884659, 0.09505678713321686, 0.3...
2 sentence_to_3     [0.15838183462619781, -0.19057893753051758, -0...

我已经用这种方法将一个列表中的每个元素迭代到另一个列表中的所有元素。我也会非常感谢一个不需要花费那么多的想法，避免循环并列出理解式，例如

embed_mat = np.array([x for x in predictions_1[e_col]])
for i in match2:
embedding = pipe.predict(i).iloc[0][e_col]
m = np.array([embedding,]*len(df))
sim_mat = cosine_similarity(m,embed_mat)
print(sim_mat[0])

output:
[0.66812827 0.60055647 0.7160895  0.730334   0.76885804 0.54169453
0.61199156 0.6578508  0.68869315 0.71536224 0.64135093 0.68568607
0.7026179  0.64319338 0.60390899 0.64774842 0.62665297 0.61611091
0.62738365 0.60333599 0.61464704 0.68141089 0.75263237 0.77213446
0.75132462]
[0.72350056 0.65223669 0.67931278 0.62036637 0.67934842 0.62129368
0.69825526 0.55635858 0.62417926 0.57909757 0.58463102 0.75053411
0.62435311 0.66574652 0.6980762  0.72050293 0.64668413 0.62632569
0.63648157 0.59476883 0.66401519 0.68794243 0.64723412 0.68215344
0.66456176]
[0.84471557 0.75666135 0.75268174 0.71671225 0.74120815 0.78075131
0.75810087 0.67278428 0.72912575 0.70120557 0.70225784 0.78829443
0.70072031 0.76282867 0.78521151 0.76517436 0.7233746  0.71423372
0.69281594 0.71363751 0.73811129 0.7231086  0.73386457 0.76077197
0.75507266]
...

该数组的每个元素表示第二个列表中一个句子与所有其他句子之间的相似度。

这个想法是，我有一个这样的最终框架，其中对于我从列表中搜索的每个元素，我在第二个列表中找到具有最高相似性的元素。

element_from       element_to       similarity
0 sentence_from_1    sentence_to_5    0.95424...
1 sentence_from_3    sentence_to_10   0.93333...
2 sentence_from_11   sentence_to_12   0.55112...

给出类似的替代解决方案:

# Cosine Similarity Calculation
def cosine_similarity(vector1, vector2):
vector1 = np.array(vector1)
vector2 = np.array(vector2)
return np.dot(vector1, vector2) / (np.sqrt(np.sum(vector1**2)) * np.sqrt(np.sum(vector2**2))) 
for i in range(embed_mat.shape[0]):
for j in range(i + 1, embed_mat.shape[0]):
print("The cosine similarity between the documents ", i, "and", j, "is: ",
cosine_similarity(embed_mat.toarray()[i], embed_mat.toarray()[j]))

Output:
The cosine similarity between the documents sentence_from_1 and sentence_to_5 is   0.95424

我甚至设法得到这样做的结果

embed_mat = np.array([x for x in predictions_1[e_col]])
to = []
fro = []
sim = []
for i in match2:
fro.append(i)
embedding = pipe.predict(i).iloc[0][e_col]
m = np.array([embedding,]*len(df))
sim_mat = cosine_similarity(m,embed_mat)
sim.append(max(sim_mat[0]))
to.append(predictions_1['document'].values[sim_mat[0].argmax()])
pd.DataFrame({'From': fro, 'To': to, 'Similarity': sim})

但我认为有更好的方法来解决它。我最好说更优化。

相关内容

最新更新

热门标签：