使用python中的正则表达式找到任意长的单词模式



我正在使用python 3.6在文本中找到所有" as" as" words " as"的出现,两边都有三个单词的上下文。

例如,如果我在

上运行程序
"The dog was as wildly energetic as the old one. It was as bright as it has ever been."

理想的输出将为

"The dog was as wildly energetic as the old one"
"one. It was as bright as it has ever"

这应该是一件容易的事,但我无法弄清楚。(我是编程的新手。)首先,我尝试在文本的单词式版本上执行此操作,但认为在原始字符串上使用正则表达式可能更容易。

我能想到的最好的是:

#FINDING __ AS __ AS __ PATTERNS
raw = "The dog was as wildly energetic as the old one. It was as bright as it has ever been."
import re
pattern_find = re.compile(r'w* as w* as w*')    #Here we specify the regex code.
results = pattern_find.findall(raw)    #Here we do the search and put the results in a list.
print(results)

输出

['was as bright as it']

完全忽略了两个出现" AS"之间有两个单词的情况。这让我感到惊讶,因为我认为通过在w上包括星号*,它将捕获任意长的单词序列。(似乎发生的事情是w*正在捕获连续字符的任意长字符串,而不是 words 。)

我的问题是:

  1. 如何使用正则表达式获得我想要的东西?
  2. 是否有更好的方法来实现我所需的结果?

注意:我知道我可以使用NLTK的concordance()来找到具有上下文的单个单词,但是它不允许用户指定单词模式。使用正则表达式的替代方法可能涉及构建concordance()的功能。

REGEX是工作的正确工具,尽管有一些陷阱。您必须制作一个模式,以捕获3个词的上下文最多,但如果没有3个单词,则更少。

此正则应该做一个技巧:

(?:S+s+){,3}b[aA]s(?:s+S+)+?s+asb(?:s+S+){,3}

说明:

(?:S+s+){,3}  # match a word, followed by space(s). Up to 3 times.
b[aA]s         # assert word boundary and match "as"
(?:s+S+)+?    # match any number of words, but as few as possible
s+             # followed by space(s)
asb            # and another "as"
(?:s+S+){,3}  # match up to 3 more words

w是一个单词字符,而不是整个单词。w*确实将匹配一个单词(即连续的单词字符)。但是,您应该更好地使用 w+,以匹配一个单词字符或更多而不是零字字符或更多

因此,您可以尝试匹配一个单词:

w+ w+ w+ as w+ as w+ w+ w+

或实际出现的数量:

(w+ ){3}as w+ as (w+ ){3}

如果您不在乎" AS"之间有多少个单词,则可以匹配任何数量的事件:

(w+ ){3}as (w+ )+as (w+ ){3}

这样做的一种更高级的方法将是:

(?:(?:w+s+)+ass+){2}(?:w+s+)+

最新更新