如何在Python中使用gensim进行字符串语义匹配

在python中，我们如何确定字符串是否与我们的短语有语义关系？

示例：

我们的短语是：

'Fruit and Vegetables'

我们要检查的字符串的语义关系是：

'I have an apple in my basket', 'I have a car in my house'

结果：

因为我们知道第一项CCD_ 1与我们的短语有关系。

您可以使用gensim库来实现MatchSemantic，并将这样的代码作为函数编写(请参阅此处的完整代码(：

初始化

安装gensim和numpy：

pip install numpy
pip install gensim

代码

首先，我们必须执行要求

from re import sub
import numpy as np
from gensim.utils import simple_preprocess
import gensim.downloader as api
from gensim.corpora import Dictionary
from gensim.models import TfidfModel
from gensim.similarities import SparseTermSimilarityMatrix, WordEmbeddingSimilarityIndex, SoftCosineSimilarity

使用此函数检查字符串和句子是否与您想要的短语匹配

def MatchSemantic(query_string, documents):
stopwords = ['the', 'and', 'are', 'a']
if len(documents) == 1: documents.append('')
def preprocess(doc):
# Tokenize, clean up input document string
doc = sub(r'<img[^<>]+(>|$)', " image_token ", doc)
doc = sub(r'<[^<>]+(>|$)', " ", doc)
doc = sub(r'[img_assist[^]]*?]', " ", doc)
doc = sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*(),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', " url_token ", doc)
return [token for token in simple_preprocess(doc, min_len=0, max_len=float("inf")) if token not in stopwords]
# Preprocess the documents, including the query string
corpus = [preprocess(document) for document in documents]
query = preprocess(query_string)
# Load the model: this is a big file, can take a while to download and open
glove = api.load("glove-wiki-gigaword-50")
similarity_index = WordEmbeddingSimilarityIndex(glove)
# Build the term dictionary, TF-idf model
dictionary = Dictionary(corpus + [query])
tfidf = TfidfModel(dictionary=dictionary)
# Create the term similarity matrix.
similarity_matrix = SparseTermSimilarityMatrix(similarity_index, dictionary, tfidf)
query_tf = tfidf[dictionary.doc2bow(query)]
index = SoftCosineSimilarity(
tfidf[[dictionary.doc2bow(document) for document in corpus]],
similarity_matrix)
return index[query_tf]

注意：如果第一次运行代码，进程栏将从0%转到100%，用于下载gensim的glove-wiki-gigaword-50，之后将设置所有内容，您可以简单地运行代码。

用法

例如，我们想看看I have an apple in my basket0是否与documents中的任何句子或项目匹配

测试：

query_string = 'Fruit and Vegetables'
documents = ['I have an apple on my basket', 'I have a car in my house']
MatchSemantic(query_string, documents)

所以我们知道第一个项目I have an apple on my basket与Fruit and Vegetables有语义关系，所以它的得分将是0.189，而对于第二个项目，将找不到任何关系，所以其得分为0

输出：

0.189    # I have an apple in my basket
0.000    # I have a car in my house

初始化

代码

用法

相关内容

最新更新

热门标签：