下面的regex模式几乎完成了我需要它做的所有事情,包括捕获收缩:
re_pattern = "[a-zA-Z]+\'?[a-zA-Z]+"
但是,如果我输入以下代码:
sent = "I can't understand what I'm doing wrong or if I made a mistake."
re.findall(re_pattern, sent)
它不会拾取一个字母的单词,例如I
或a
:
["can't", 'understand', 'what', "I'm", 'doing', 'wrong', 'or', 'if', 'made', 'mistake']
您试图匹配至少2个字符的单词,因为第二个+也需要至少一个匹配,中间有一个可选的'
。将其更改为可选*将实现
>>> re_pattern = "[a-zA-Z]+\'?[a-zA-Z]*"
>>> re.findall(re_pattern, sent)
['I', "can't", 'understand', 'what', "I'm", 'doing', 'wrong', 'or', 'if', 'I', 'made', 'a', 'mistake']
您需要使用
re_pattern = r"[a-zA-Z]+(?:'[a-zA-Z]+)?"
请参阅regex演示和Python演示:
import re
re_pattern = r"[a-zA-Z]+(?:'[a-zA-Z]+)?"
sent = "I can't understand what I'm doing wrong or if I made a mistake."
print( re.findall(re_pattern, sent) )
# => ['I', "can't", 'understand', 'what', "I'm", 'doing', 'wrong', 'or', 'if', 'I', 'made', 'a', 'mistake']
注意:如果不需要提取粘贴在_
或数字上的字母序列,请使用单词边界:
re_pattern = r"b[a-zA-Z]+(?:'[a-zA-Z]+)?b"
请参阅regex演示。如果你计划匹配任何Unicode单词:
re_pattern = r"b[^Wd_]+(?:'[^Wd_]+)?b"
请参阅regex演示。
啊,如果你还想匹配数字和下划线作为";单词";,只需使用
re_pattern = r"w+(?:'w+)*"
在(?:'w+)
之后的*
允许像rock'n'roll
这样的匹配词。