re.findall(r'[A-Za-z]+(?='|.|-[A-Za-z]+)?', txt)
re.findall(r'[A-Za-z.-]+(?:'[A-Za-z]+)?',txt)
输入
txt = "which would find I'm U.S. co-op, include ending. without the . , but not ' - . rd- "
预期输出
['which', 'would', 'find', "I'm", "U.S.", 'co-op', 'include', 'ending', 'without', 'the', 'but', 'not', 'rd']
我尝试了以上和变体,但无法使其发挥作用。怎么做?
您可以使用此正则表达式进行使用findall
:的匹配
w+(?:['.-]w+.?)?
RegEx演示
RegEx详细信息:
w+
:匹配1个以上单词字符(?:['.-]w+.?)?
:可选的非捕获组,以'
、.
或-
开头,后跟1+个单词字符和可选的尾随点
代码:
import re
txt = "which would find I'm U.S. co-op, include ending. without the . , but not ' - . rd- "
print (re.findall(r"bw+(?:['.-]w+.?)?", txt))
['which', 'would', 'find', "I'm", 'U.S.', 'co-op', 'include', 'ending', 'without', 'the', 'but', 'not', 'rd']
冒着过度思考实际问题的风险,我尝试了以下假设:
- 您只想使用字母
[A-Za-z]
- 在
"let's play co-op."
这样的情况下,您不希望匹配尾随点 - 最后,我想你也会想要捕捉像
"non-English-speaking"
这样的双连字符单词和不仅仅是一个点的缩写
因此,我想到的是:
b[a-z]+(?:(?:(.)|['-])[a-z]+1?)*
查看在线演示。
b
-一个词的边界[a-z]+
-1+字母字符(?:
-打开第一个非捕获组:(?:
-打开第1个非捕获组:(.)|['-]
-第一个捕获组,持有一个点或连字符或撇号)[a-z]+1?
-关闭第二个非捕获组,匹配1+个字母字符,并可选择匹配第一个捕获组中捕获的内容(因此为一个点(
)*
-关闭第一个非捕获组并匹配0+次
在Python中,它可能看起来像:
import re
txt = "which would find I'm U.S. co-op, include ending. without the . , but not ' - . rd- "
lst = [m.group(0) for m in re.finditer(r"b[a-z]+(?:(?:(.)|['-])[a-z]+1?)*", txt, re.I)]
print(lst) # ['which', 'would', 'find', "I'm", 'U.S.', 'co-op', 'include', 'ending', 'without', 'the', 'but', 'not', 'rd']