如何在Python上执行精确的字符串匹配



我有一组单词

单词= {'谢谢给予','cat','而不是'等...}

我需要在表列中准确搜索这些单词'描述'

--------------------------------|
ID  | Description               |
--- |---------------------------|
1   | having fun   thanks giving| 
----|---------------------------|
2   |  cat eats all the food    |
----|---------------------------|
3   |  instead you can come     | 
--------------------------------
def matched_words(x,words):
   match_words =[]
  for word in words:
     if word in x:
       match_words.append(word)
  return match_words
df['new_col'] = df['description'].apply(lambda x:matched_words(x,words))

所需的输出:

----|---------------------------|-------------------|
ID  | Description               |matched words      |
--- |---------------------------|-------------------|
1   | having fun   thanks giving|['thanks giving']  |
----|---------------------------|------------------ |
2   |  cat eats all the food    |['cat']            |
----|---------------------------|-------------------|
3   |  instead you can come     | []                |
----------------------------------------------------

我只得到匹配项,例如['cat']

以下代码应为您提供所需的结果:

import re
words = {'thanks', 'cat', 'instead of'}
phrases = [
    [1,"having fun at thanksgiving"],
    [2,"cater the food"],
    [3, "instead you can come"],
    [4, "instead of pizza"],
    [5, "thanks for all the fish"]
]
matched_words = []
matched_pairs = []
for word in words:
    for phrase in phrases:
        result = re.search(r'b'+word+'W', phrase[1])
        if result:
            matched_words.append(result.group(0))
            matched_pairs.append([result.group(0), phrase])
            print()
print(matched_words)
print(matched_pairs)

相关部分,即regexre.search(r'b'+word+'W', phrase[1]),正在搜索从单词边界bempty string开始发现我们的搜索字符串的情况,并以非词字符W结束。这应该确保我们仅找到全串匹配。无需对您要搜索的文本做任何其他事情。

当然,您可以使用所需的任何东西,而不是wordsphrasesmatched_wordsmatched_pairs

希望这会有所帮助!

import re
words = {'thanks', 'cat', 'instead of'}
samples = [
    (1, 'having fun at thanksgiving'),
    (2, 'cater the food'),
    (3, 'instead you can come'),
    (4, 'instead of you can come'),
]
for id, description in samples:
    for word in words:
        if re.search(r'b' + word + r'b', description):
            print("'%s' in '%s" % (word, description))

相关内容

  • 没有找到相关文章

最新更新