如何在Python中使用gensim进行字符串语义匹配



在python中,我们如何确定字符串是否与我们的短语有语义关系?

示例:

我们的短语是:

'Fruit and Vegetables'

我们要检查的字符串的语义关系是:

'I have an apple in my basket', 'I have a car in my house'

结果:

因为我们知道第一项CCD_ 1与我们的短语有关系。

您可以使用gensim库来实现MatchSemantic,并将这样的代码作为函数编写(请参阅此处的完整代码(:

初始化


  1. 安装gensimnumpy
pip install numpy
pip install gensim

代码


  1. 首先,我们必须执行要求
from re import sub
import numpy as np
from gensim.utils import simple_preprocess
import gensim.downloader as api
from gensim.corpora import Dictionary
from gensim.models import TfidfModel
from gensim.similarities import SparseTermSimilarityMatrix, WordEmbeddingSimilarityIndex, SoftCosineSimilarity
  1. 使用此函数检查字符串和句子是否与您想要的短语匹配
def MatchSemantic(query_string, documents):
stopwords = ['the', 'and', 'are', 'a']
if len(documents) == 1: documents.append('')
def preprocess(doc):
# Tokenize, clean up input document string
doc = sub(r'<img[^<>]+(>|$)', " image_token ", doc)
doc = sub(r'<[^<>]+(>|$)', " ", doc)
doc = sub(r'[img_assist[^]]*?]', " ", doc)
doc = sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*(),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', " url_token ", doc)
return [token for token in simple_preprocess(doc, min_len=0, max_len=float("inf")) if token not in stopwords]
# Preprocess the documents, including the query string
corpus = [preprocess(document) for document in documents]
query = preprocess(query_string)
# Load the model: this is a big file, can take a while to download and open
glove = api.load("glove-wiki-gigaword-50")
similarity_index = WordEmbeddingSimilarityIndex(glove)
# Build the term dictionary, TF-idf model
dictionary = Dictionary(corpus + [query])
tfidf = TfidfModel(dictionary=dictionary)
# Create the term similarity matrix.
similarity_matrix = SparseTermSimilarityMatrix(similarity_index, dictionary, tfidf)
query_tf = tfidf[dictionary.doc2bow(query)]
index = SoftCosineSimilarity(
tfidf[[dictionary.doc2bow(document) for document in corpus]],
similarity_matrix)
return index[query_tf]

注意:如果第一次运行代码,进程栏将从0%转到100%,用于下载gensimglove-wiki-gigaword-50,之后将设置所有内容,您可以简单地运行代码。

用法


例如,我们想看看I have an apple in my basket0是否与documents中的任何句子或项目匹配

测试:

query_string = 'Fruit and Vegetables'
documents = ['I have an apple on my basket', 'I have a car in my house']
MatchSemantic(query_string, documents)

所以我们知道第一个项目I have an apple on my basketFruit and Vegetables有语义关系,所以它的得分将是0.189,而对于第二个项目,将找不到任何关系,所以其得分为0

输出:

0.189    # I have an apple in my basket
0.000    # I have a car in my house

最新更新