Regex查找一个或多个字符，包括介于句点、撇号或hypen之间的字符.如果最后一个符号在末尾只出现一次，则不带最后一个

re.findall(r'[A-Za-z]+(?='|.|-[A-Za-z]+)?', txt) 
re.findall(r'[A-Za-z.-]+(?:'[A-Za-z]+)?',txt)

输入

txt = "which would find I'm U.S. co-op, include ending. without the . , but not ' - . rd- "

预期输出

['which', 'would', 'find', "I'm", "U.S.", 'co-op', 'include', 'ending', 'without', 'the', 'but', 'not', 'rd']

我尝试了以上和变体，但无法使其发挥作用。怎么做？

您可以使用此正则表达式进行使用findall:的匹配

w+(?:['.-]w+.?)?

RegEx演示

RegEx详细信息：

w+：匹配1个以上单词字符
(?:['.-]w+.?)?：可选的非捕获组，以'、.或-开头，后跟1+个单词字符和可选的尾随点

代码：

import re
txt = "which would find I'm U.S. co-op, include ending. without the . , but not ' - . rd- "
print (re.findall(r"bw+(?:['.-]w+.?)?", txt))
['which', 'would', 'find', "I'm", 'U.S.', 'co-op', 'include', 'ending', 'without', 'the', 'but', 'not', 'rd']

冒着过度思考实际问题的风险，我尝试了以下假设：

您只想使用字母[A-Za-z]
在"let's play co-op."这样的情况下，您不希望匹配尾随点
最后，我想你也会想要捕捉像"non-English-speaking"这样的双连字符单词和不仅仅是一个点的缩写

因此，我想到的是：

b[a-z]+(?:(?:(.)|['-])[a-z]+1?)*

查看在线演示。

b-一个词的边界
[a-z]+-1+字母字符
(?:-打开第一个非捕获组：
- (?:-打开第1个非捕获组：
  - (.)|['-]-第一个捕获组，持有一个点或连字符或撇号
  - )[a-z]+1?-关闭第二个非捕获组，匹配1+个字母字符，并可选择匹配第一个捕获组中捕获的内容(因此为一个点(
- )*-关闭第一个非捕获组并匹配0+次

在Python中，它可能看起来像：

import re
txt = "which would find I'm U.S. co-op, include ending. without the . , but not ' - . rd- "
lst = [m.group(0) for m in re.finditer(r"b[a-z]+(?:(?:(.)|['-])[a-z]+1?)*", txt, re.I)]
print(lst) # ['which', 'would', 'find', "I'm", 'U.S.', 'co-op', 'include', 'ending', 'without', 'the', 'but', 'not', 'rd']

相关内容

最新更新

热门标签：