我有两个列表,思路是将其中一个列表的每个元素与另一个列表的所有元素进行比较,以提取相似性最大的元素。就像一个搜索引擎。
NLU中使用的变量:
import numpy as np
import nlu
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
match = ['sentence_to_1',
'sentence_to_2',
'sentence_to_3',
...]
match2 = ['sentence_from_1',
'sentence_from_2',
'sentence_from_3',
'sentence_from_1',
...]
pipe = nlu.load('xx.embed_sentence.bert_use_cmlm_multi_base_br')
df = pd.DataFrame({'one': match, 'two': match2})
predictions_1 = pipe.predict(df.one, output_level='document')
predictions_2 = pipe.predict(df.two, output_level='document')
e_col = 'sentence_embedding_bert_use_cmlm_multi_base_br'
predictions_1
output:
document sentence_embedding_bert_use_cmlm_multi_base_br
0 sentence_to_1 [0.018291207030415535, -0.05946089327335358, -...
1 sentence_to_2 [0.04855785518884659, 0.09505678713321686, 0.3...
2 sentence_to_3 [0.15838183462619781, -0.19057893753051758, -0...
我已经用这种方法将一个列表中的每个元素迭代到另一个列表中的所有元素。我也会非常感谢一个不需要花费那么多的想法,避免循环并列出理解式,例如
embed_mat = np.array([x for x in predictions_1[e_col]])
for i in match2:
embedding = pipe.predict(i).iloc[0][e_col]
m = np.array([embedding,]*len(df))
sim_mat = cosine_similarity(m,embed_mat)
print(sim_mat[0])
output:
[0.66812827 0.60055647 0.7160895 0.730334 0.76885804 0.54169453
0.61199156 0.6578508 0.68869315 0.71536224 0.64135093 0.68568607
0.7026179 0.64319338 0.60390899 0.64774842 0.62665297 0.61611091
0.62738365 0.60333599 0.61464704 0.68141089 0.75263237 0.77213446
0.75132462]
[0.72350056 0.65223669 0.67931278 0.62036637 0.67934842 0.62129368
0.69825526 0.55635858 0.62417926 0.57909757 0.58463102 0.75053411
0.62435311 0.66574652 0.6980762 0.72050293 0.64668413 0.62632569
0.63648157 0.59476883 0.66401519 0.68794243 0.64723412 0.68215344
0.66456176]
[0.84471557 0.75666135 0.75268174 0.71671225 0.74120815 0.78075131
0.75810087 0.67278428 0.72912575 0.70120557 0.70225784 0.78829443
0.70072031 0.76282867 0.78521151 0.76517436 0.7233746 0.71423372
0.69281594 0.71363751 0.73811129 0.7231086 0.73386457 0.76077197
0.75507266]
...
该数组的每个元素表示第二个列表中一个句子与所有其他句子之间的相似度。
这个想法是,我有一个这样的最终框架,其中对于我从列表中搜索的每个元素,我在第二个列表中找到具有最高相似性的元素。
element_from element_to similarity
0 sentence_from_1 sentence_to_5 0.95424...
1 sentence_from_3 sentence_to_10 0.93333...
2 sentence_from_11 sentence_to_12 0.55112...
给出类似的替代解决方案:
# Cosine Similarity Calculation
def cosine_similarity(vector1, vector2):
vector1 = np.array(vector1)
vector2 = np.array(vector2)
return np.dot(vector1, vector2) / (np.sqrt(np.sum(vector1**2)) * np.sqrt(np.sum(vector2**2)))
for i in range(embed_mat.shape[0]):
for j in range(i + 1, embed_mat.shape[0]):
print("The cosine similarity between the documents ", i, "and", j, "is: ",
cosine_similarity(embed_mat.toarray()[i], embed_mat.toarray()[j]))
Output:
The cosine similarity between the documents sentence_from_1 and sentence_to_5 is 0.95424
我甚至设法得到这样做的结果
embed_mat = np.array([x for x in predictions_1[e_col]])
to = []
fro = []
sim = []
for i in match2:
fro.append(i)
embedding = pipe.predict(i).iloc[0][e_col]
m = np.array([embedding,]*len(df))
sim_mat = cosine_similarity(m,embed_mat)
sim.append(max(sim_mat[0]))
to.append(predictions_1['document'].values[sim_mat[0].argmax()])
pd.DataFrame({'From': fro, 'To': to, 'Similarity': sim})
但我认为有更好的方法来解决它。我最好说更优化。