我正在寻找一种解决方案,以使用Gensim
中的most_similar()
,但使用Spacy
。我想在使用NLP的句子列表中找到最相似的句子。
我尝试使用Spacy
(例如https://spacy.io/api/doc#simarility(从循环中使用similarity()
,但需要很长时间。
更深入:
我想将所有这些句子放在图中(像这样(以找到句子簇。
有什么想法?
这是一个简单的内置解决方案,您可以使用:
import spacy
nlp = spacy.load("en_core_web_lg")
text = (
"Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity."
" These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature."
" The term semantic similarity is often confused with semantic relatedness."
" Semantic relatedness includes any relation between two terms, while semantic similarity only includes 'is a' relations."
" My favorite fruit is apples."
)
doc = nlp(text)
max_similarity = 0.0
most_similar = None, None
for i, sent in enumerate(doc.sents):
for j, other in enumerate(doc.sents):
if j <= i:
continue
similarity = sent.similarity(other)
if similarity > max_similarity:
max_similarity = similarity
most_similar = sent, other
print("Most similar sentences are:")
print(f"-> '{most_similar[0]}'")
print("and")
print(f"-> '{most_similar[1]}'")
print(f"with a similarity of {max_similarity}")
(Wikipedia的文字(
它将产生以下输出:
Most similar sentences are:
-> 'Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity.'
and
-> 'These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature.'
with a similarity of 0.9583859443664551
请注意spacy.io的以下信息:
为了使它们紧凑且快速,Spacy的小管道包(所有以SM结尾的包装(不会与单词矢量发货,而仅包含上下文敏感的张量。这意味着您仍然可以使用相似性((方法来比较文档,跨度和令牌 - 但结果不会那么好,并且单个令牌不会分配任何向量。因此,为了使用真实单词向量,您需要下载较大的管道包:
- python -m spacy download en_core_web_sm + python -m spacy download en_core_web_lg
还有关如何提高相似性分数的建议,请参见Spacy vs Word2Vec中的文档相似性。