如何使用regex捕获单字母单词和多字母单词



下面的regex模式几乎完成了我需要它做的所有事情,包括捕获收缩:

re_pattern = "[a-zA-Z]+\'?[a-zA-Z]+"

但是,如果我输入以下代码:

sent = "I can't understand what I'm doing wrong or if I made a mistake."
re.findall(re_pattern, sent)

它不会拾取一个字母的单词,例如Ia:

["can't", 'understand', 'what', "I'm", 'doing', 'wrong', 'or', 'if', 'made', 'mistake']

您试图匹配至少2个字符的单词,因为第二个+也需要至少一个匹配,中间有一个可选的'。将其更改为可选*将实现

>>> re_pattern = "[a-zA-Z]+\'?[a-zA-Z]*"
>>> re.findall(re_pattern, sent)
['I', "can't", 'understand', 'what', "I'm", 'doing', 'wrong', 'or', 'if', 'I', 'made', 'a', 'mistake']

您需要使用

re_pattern = r"[a-zA-Z]+(?:'[a-zA-Z]+)?"

请参阅regex演示和Python演示:

import re
re_pattern = r"[a-zA-Z]+(?:'[a-zA-Z]+)?"
sent = "I can't understand what I'm doing wrong or if I made a mistake."
print( re.findall(re_pattern, sent) )
# => ['I', "can't", 'understand', 'what', "I'm", 'doing', 'wrong', 'or', 'if', 'I', 'made', 'a', 'mistake']

注意:如果不需要提取粘贴在_或数字上的字母序列,请使用单词边界:

re_pattern = r"b[a-zA-Z]+(?:'[a-zA-Z]+)?b"

请参阅regex演示。如果你计划匹配任何Unicode单词:

re_pattern = r"b[^Wd_]+(?:'[^Wd_]+)?b"

请参阅regex演示。

啊,如果你还想匹配数字和下划线作为";单词";,只需使用

re_pattern = r"w+(?:'w+)*"

(?:'w+)之后的*允许像rock'n'roll这样的匹配词。

最新更新