单词边界，以匹配开头/结尾包含点(.)的字符串

我有一个正则表达式来匹配长文本中的单词，如下所示：

word = "word"
text = "word subword word"
def char_regex_ascii(word):
return r"b{}b".format(re.escape(word))
r = re.compile(my_regex(word), flags= re.X | re.UNICODE)
for m in r.finditer(text):
print(m)

输出：

word
word

b的原因是我不想找到子字符串，而是完整的单词：例如，我对匹配文本subword中的单词word不感兴趣，但我只希望完整的单词作为结果，因此后面或前面有空格、逗号、点或任何标点符号。

它适用于大多数情况，但如果我在单词末尾插入一个点，比如w.o.r.d.，它就不匹配，因为正则表达式的最后一个b在一个点之后。

word = "w.o.r.d."
text = "w.o.r.d. subword word"
def char_regex_ascii(word):
return r"b{}b".format(re.escape(word))
r = re.compile(my_regex(word), flags= re.X | re.UNICODE)
for m in r.finditer(text):
print(m)

输出：

(nothing)

我知道使用B是可行的，但我应该在句子的开头和结尾做几次检查，尝试b和B的所有组合来查找许多单词。

word = "w.o.r.d."
text = "w.o.r.d. subword word"
def char_regex_ascii(word):
return r"b{}B".format(re.escape(word))
r = re.compile(my_regex(word), flags= re.X | re.UNICODE)
for m in r.finditer(text):
print(m)

输出：

w.o.r.d.

是否存在通用方法？

您可以使用正则表达式模式w+(?:.?w+)*和re.findall:

text = "w.o.r.d. subword word"
matches = re.findall(r'w+(?:.?w+)*', text)
print(matches)  # ['w.o.r.d', 'subword', 'word']

这里使用的模式定义了"；单词"；作为：

w+         one or more word characters
(?:
.?w+  followed by optional dot and one or more
word characters
)*          zero or more times

根据这个定义，诸如w.o.r.d.之类的首字母缩略词风格的术语将被捕获为匹配项。

相关内容

最新更新

热门标签：