如何在python3中为多个相同的字符编写regex模式



我有一个如下形式的长字符串:

joined_string = "ASOGHFFFFFFFFFFFFFFFFFFFGFIOSGFFFFFFFFURHDHREEKFFFFFFIIIEI..."

它是随机字符串的串联,中间穿插着连续的F字母串:

ASOGH
FFFFFFFFFFFFFFFFFFF
GFIOSG
FFFFFFFF
URHDHREEK
FFFFFF
IIIEI

连续的CCD_ 2字母的数量不是固定的,并且假设五个CCD_ 3字母不会连续出现在随机串中。

我只想提取随机字符串来获得以下列表:

random_strings = ['ASOGH', 'GFIOSG', 'URHDHREEK', 'IIIEI']

我想有一个简单的正则表达式可以解决这个任务:

random_strings = joined_string.split('WHAT_TO_TYPE_HERE?')

问题:如何为多个相同的字符编写正则表达式模式?

我会按照的方式使用re.split执行此任务

import re
joined_string = "ASOGHFFFFFFFFFFFFFFFFFFFGFIOSGFFFFFFFFURHDHREEKFFFFFFIIIEI"
parts = re.split('F{5,}',joined_string)
print(parts)

输出

['ASOGH', 'GFIOSG', 'URHDHREEK', 'IIIEI']

F{5,}表示5个或更多F

您可以使用F{5,}使用拆分,并将其保留在捕获组中,这样拆分文本也是结果的一部分:

import re
s = "ASOGHFFFFFFFFFFFFFFFFFFFGFIOSGFFFFFFFFURHDHREEKFFFFFFIIIEI"
print( re.split(r'(F{5,})', s) )

输出:

['ASOGH', 'FFFFFFFFFFFFFFFFFFF', 'GFIOSG', 'FFFFFFFF', 'URHDHREEK', 'FFFFFF', 'IIIEI']

我会在这里使用regex find-all方法:

joined_string = "ASOGHFFFFFFFFFFFFFFFFFFFGFIOSGFFFFFFFFURHDHREEKFFFFFFIIIEI"
parts = re.findall(r'F{2,}|(?:[A-EG-Z]|F(?!F))+', joined_string)
print(parts)

此打印:

['ASOGH', 'FFFFFFFFFFFFFFFFFFF', 'GFIOSG', 'FFFFFFFF', 'URHDHREEK', 'FFFFFF', 'IIIEI']

这里的正则表达式模式可以解释为:

F{2,}         match any group of 2 or more consecutive F's (first)
|             OR, that failing
(?:
[A-EG-Z]  match any non F character
|         OR
F(?!F)    match a single F (not followed by an F)
)+            all of these, one or more times

最新更新