我有一个如下形式的长字符串:
joined_string = "ASOGHFFFFFFFFFFFFFFFFFFFGFIOSGFFFFFFFFURHDHREEKFFFFFFIIIEI..."
它是随机字符串的串联,中间穿插着连续的F
字母串:
ASOGH
FFFFFFFFFFFFFFFFFFF
GFIOSG
FFFFFFFF
URHDHREEK
FFFFFF
IIIEI
连续的CCD_ 2字母的数量不是固定的,并且假设五个CCD_ 3字母不会连续出现在随机串中。
我只想提取随机字符串来获得以下列表:
random_strings = ['ASOGH', 'GFIOSG', 'URHDHREEK', 'IIIEI']
我想有一个简单的正则表达式可以解决这个任务:
random_strings = joined_string.split('WHAT_TO_TYPE_HERE?')
问题:如何为多个相同的字符编写正则表达式模式?
我会按照的方式使用re.split
执行此任务
import re
joined_string = "ASOGHFFFFFFFFFFFFFFFFFFFGFIOSGFFFFFFFFURHDHREEKFFFFFFIIIEI"
parts = re.split('F{5,}',joined_string)
print(parts)
输出
['ASOGH', 'GFIOSG', 'URHDHREEK', 'IIIEI']
F{5,}
表示5个或更多F
您可以使用F{5,}
使用拆分,并将其保留在捕获组中,这样拆分文本也是结果的一部分:
import re
s = "ASOGHFFFFFFFFFFFFFFFFFFFGFIOSGFFFFFFFFURHDHREEKFFFFFFIIIEI"
print( re.split(r'(F{5,})', s) )
输出:
['ASOGH', 'FFFFFFFFFFFFFFFFFFF', 'GFIOSG', 'FFFFFFFF', 'URHDHREEK', 'FFFFFF', 'IIIEI']
我会在这里使用regex find-all方法:
joined_string = "ASOGHFFFFFFFFFFFFFFFFFFFGFIOSGFFFFFFFFURHDHREEKFFFFFFIIIEI"
parts = re.findall(r'F{2,}|(?:[A-EG-Z]|F(?!F))+', joined_string)
print(parts)
此打印:
['ASOGH', 'FFFFFFFFFFFFFFFFFFF', 'GFIOSG', 'FFFFFFFF', 'URHDHREEK', 'FFFFFF', 'IIIEI']
这里的正则表达式模式可以解释为:
F{2,} match any group of 2 or more consecutive F's (first)
| OR, that failing
(?:
[A-EG-Z] match any non F character
| OR
F(?!F) match a single F (not followed by an F)
)+ all of these, one or more times