使用Doc2Vec测量两个文档之间的相似性

我已经训练了gensim doc2Vec模型，它可以找到与未知文档最相似的文档。

现在我需要找到两个未知文档之间的相似性值(它们不在训练数据中，所以它们不能被文档id引用(

d2v_model = doc2vec.Doc2Vec.load(model_file)
string1 = 'this is some random paragraph'
string2 = 'this is another random paragraph'
vec1 = d2v_model.infer_vector(string1.split())
vec2 = d2v_model.infer_vector(string2.split())

在上面的代码中，vec1和vec2被成功初始化为一些值，大小为"vector_size">

现在，通过查看gensim api和示例，我找不到适合我的方法，所有这些都期望TaggedDocument

我可以逐值比较特征向量吗？如果它们更接近=>文本就更相似？

你好，如果有人感兴趣，只需要两个向量之间的余弦距离就可以了。

我发现大多数人都在使用"空间"来进行这种pourpose

这里有一个小代码狙击，如果你已经训练了doc2vec ，它应该工作得很好

from gensim.models import doc2vec
from scipy import spatial
d2v_model = doc2vec.Doc2Vec.load(model_file)
fisrt_text = '..'
second_text = '..'
vec1 = d2v_model.infer_vector(fisrt_text.split())
vec2 = d2v_model.infer_vector(second_text.split())
cos_distance = spatial.distance.cosine(vec1, vec2)
# cos_distance indicates how much the two texts differ from each other:
# higher values mean more distant (i.e. different) texts

相关内容

最新更新

热门标签：