在字符串中搜索单词/短语,其中包含该短语的所有可能的近似值



假设我有以下字符串:

string = 'machine learning ml is a type of artificial intelligence ai that allows software applications to become more accurate at predicting outcomes without being explicitly programmed to do so machine12 learning algorithms use historical data as input to predict new output values machines learning is good'

进一步假设我有一个定义为:

tag = 'machine learning'

现在我希望在我的字符串中找到标签。从我的string可以看出,我有三个地方machine learning,一个在string的开头,一个是machine12 learning,最后一个是machines learning。我希望找到所有这些并将输出列表制作为

['machine learning', 'machine12 learning', 'machines learning']

为了能够做到这一点,我试图使用 nltk 标记我的标签。那是

tag_token = nltk.word_tokenize(tag)

然后我会['machine','learning'].然后我会搜索tag[0].

我知道string.find(tag_token[0])data.rfind(tag_token[0])会给出第一个和最后一个发现的machine位置,但是如果我在文本中有更多的machine learning(这里有 3 个)呢?

在这种情况下,我将无法将它们全部提取出来。所以我最初的想法是找到所有出现的machine然后learning会失败。我希望使用fuzzywuzzy来分析与标签相关的['machine learning', 'machine12 learning', 'machines learning']

所以我的问题string我有,我如何搜索标签及其近似值并将它们列出如下?

['machine learning', 'machine12 learning', 'machines learning']
>更新:我现在知道我可以执行以下操作:
pattern = re.compile(r"(machine[s0-9]+learning)",re.IGNORECASE)
matches = pattern.findall(data)
#[output]: ['machine learning', 'machine12 learning']

也如果我这样做

pattern = re.compile(r"(machine[sA-Za-z]+learning)",re.IGNORECASE)
matches = pattern.findall(data)
#[output]: ['machine learning', 'machines learning']

但可以肯定的是,就目前而言,这不是一个可推广的解决方案。所以我想知道在这种情况下是否有一种聪明的搜索方法?

也许使用这样的模式(string\w*)?

import re
string = 'machine 12 learning ml is a type of artificial intelligence ai that allows software applications to become more accurate at predicting outcomes without being explicitly programmed to do so machine12 learning algorithms use historical data as input to predict new output values machines learning is good'
tag_token=['machine','learning']
pattern='('+''.join(e+'w*s+(?:S*s+)?' for e in tag_token)[:-14]+')'
rgx=re.compile(pattern,re.IGNORECASE)
rgx.findall(string)
#output
#['machine 12 learning', 'machine12 learning', 'machines learning']

在标签中单词位置变化的情况下,很难找到匹配项

并且此代码将找到来自tag_token的所有组合。 例如machine s learningmachine learningmachine12 12 learninglearning machine...您还可以创建包含 2 个以上单词的新字符串和新tag_token。将找到这些单词的所有组合。

示例tag_token = ['1', '2', '3']将匹配1 2 31a 2 b 3以及2b2 1sss 3333 2tt 1

import re
import itertools
string = 'machine 12 learning ml is a type of artificial intelligence ai that allows software applications to become more accurate at predicting outcomes without being explicitly programmed to do so machine12 learning algorithms use historical data as input to predict new output values machines learning is good. Learning machine can be used to train people. learning the machines is a great job'
tag_token=['machine','learning']
pattern='('
for current_tag in itertools.permutations(tag_token, len(tag_token)):
pattern+=''.join(e+'w*s+(?:S*s+)?' for e in current_tag)[:-14]+'|'
pattern=pattern.rstrip('|')+')'
rgx=re.compile(pattern,re.IGNORECASE)
rgx.findall(string)
#output
#['machine 12 learning', 'machine12 learning', 'machines learning', 'Learning machine', 'learning the machines']

最新更新