我正在尝试匹配包含1个以上字母的单词:全大写、第一个字母小写和后面的字母大写,或者仅当所有字母都是大写时才在中间包含连字符。这是我的代码:
s = "ASCII, aSCII, AS-CII, AS-cii"
myset = set(re.findall(r"b[a-z]?[A-Z]+-?[A-Z]{1,}",s))
Out[555]: {'AS', 'AS-CII', 'ASCII', 'aSCII'}
正如您所看到的,不应该返回"AS"
,因为它在连字符后面包含小写字母。我该怎么解决这个问题?
尝试了这个,但结果是一个错误:
myset = set(re.findall(r"b[a-z]?[A-Z]+-?[A-Z]+{1,}",s))
File "<ipython-input-545-7bdc0c902553>"
myset = set(re.findall(r"b[a-z]?[A-Z]+-?[A-Z]+{1,}",s))
File "/home/c1962135/.local/share/virtualenvs/c1962135-9R_1M4TP/lib/python3.6/re.py", line 222, in findall
return _compile(pattern, flags).findall(string)
File "/home/c1962135/.local/share/virtualenvs/c1962135-9R_1M4TP/lib/python3.6/re.py", line 301, in _compile
p = sre_compile.compile(pattern, flags)
File "/home/c1962135/.local/share/virtualenvs/c1962135-9R_1M4TP/lib/python3.6/sre_compile.py", line 562, in compile
p = sre_parse.parse(p, flags)
File "/home/c1962135/.local/share/virtualenvs/c1962135-9R_1M4TP/lib/python3.6/sre_parse.py", line 855, in parse
p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)
File "/home/c1962135/.local/share/virtualenvs/c1962135-9R_1M4TP/lib/python3.6/sre_parse.py", line 416, in _parse_sub
not nested and not items))
File "/home/c1962135/.local/share/virtualenvs/c1962135-9R_1M4TP/lib/python3.6/sre_parse.py", line 619, in _parse
source.tell() - here + len(this))
error: multiple repeat
您可以使用条件表达式:
(...)?(if true than this|else this)
对于您的情况,这可能是
b([a-z])?(?(1)[A-Z]+|[-A-Z]+[A-Z])(?!-)b
请参阅regex101.com上的演示。
细分后读取
b # a word boundary
([a-z])? # match a lower case letter if it is there
(?(1) # if the lower case letter is there, match this branch
[A-Z]+
|
[-A-Z]+[A-Z] # else this one
)
(?!-)b # do not break at a -, followed by another boundary
这里是
res = [x[0] for x in re.findall(r"(([a-z]{1}[A-Z]+)|([A-Z]+-[A-Z]+))",s)]
print(res)
print(set(res))
给出
['aSCII', 'AS-CII']
告诉我。我拆分为添加OR逻辑,中间有|。
以下正则表达式匹配所有提到的标准:
b[a-z]*[A-Z]+[-A-Z]+[A-Z]+b
请在此处查看https://regex101.com/r/JNC4kN/1/
但是,如果你给出这种类型的例子,比如aTHTHTH(连字符和大写字母后面的小写字母(,这将失败。如果你只想要UPPER-UPPER,那么按照这个正则表达式:
b[a-z]{0,1}(?<!-)[A-Z]+b(?!-)|b[A-Z]+-[A-Z]+b
检查此处
您可以使用以下正则表达式,它涵盖了与连字符前面或后面的单词有关的边缘大小写(如下面的链接所示(:
(?<!w|(?<=w)-)(?:[a-zA-Z][A-Z]+|[A-Z]{2,}|[A-Z]+-[A-Z]+)(?!w|-(?=w))
演示
Python的正则表达式引擎执行以下操作。
(?<! # begin a negative lookbehind
w # match word char
| # or
(?<=w) # match a word char in a positive lookbehind
- # match '-'
) # end negative lookbehind
(?: # begin non-cap grp
[a-zA-Z][A-Z]+ # match a lc letter then 1+ uc letters
| # or
[A-Z]{2,} # match 2+ uc letters
| # or
[A-Z]+-[A-Z]+ # match 1+ uc letters, '-', then 1+ uc letters
) # end non-cap grp
(?! # begin negative lookahead
w # match word char
| # or
- # match '-'
(?=w) # match a word char in a positive lookahead
) # end negative lookahead