我想捕获以单分号开头的行之间的文本:
示例输入:
s = '''
;
the color blue
;
the color green
;
the color red
;
'''
这是所需的输出:
['the color blue', 'the color green', 'the color red']
此尝试的解决方案不起作用:
import re
pat = r'^;(.*)^;'
r = re.findall(pat, s, re.S|re.M)
print(r)
这是错误的输出:
['nnthe color bluenn;nnthe color greennn;nnthe color rednn']
将其视为分隔符。
(?sm)^;s*r?n(.*?)s*(?=^;s*r?n)
https://regex101.com/r/4tKX0F/1
解释
(?sm) # Modifiers: dot-all, multi-line
^ ; s* r? n # Begining delimiter
( .*? ) # (1), Text
s* # Wsp trim
(?= ^ ; s* r? n ) # End delimiter
非正则表达式解决方案,我在;
上拆分并删除空字符串
s = '''
;
the color blue
;
the color green
;
the color red
;
'''
f = s.split(';')
x = [a.strip('n') for a in f]
print(x) #prints ['', 'the color blue', 'the color green', 'the color red', '']
a = [elem for elem in x if len(elem)]
print(a) #prints ['the color blue', 'the color green', 'the color red']
你可以把它作为模式:
pat = r';nn([w* *]*)'
r = re.findall(pat, s)
这应该捕获您需要的内容。
你知道你没有要求这个。但值得考虑将pyparsing作为re的替代方案。事实上,pyparing 正确地包含正则表达式。请注意这个简单的解析器如何应对各种数量的空行。
>>> parsifal = open('temp.txt').read()
>>> print (parsifal)
;
the colour blue
;
the colour green
;
the colour red
;
the colour purple
;
the colour magenta
;
>>> import pyparsing as pp
>>> p = pp.OneOrMore(pp.Suppress(';n')+pp.ZeroOrMore(pp.Suppress('n'))+pp.CharsNotIn(';n')+pp.ZeroOrMore(pp.Suppress('n')))
>>> p.parseString(parsifal)
(['the colour blue', 'the colour green', 'the colour red', 'the colour purple', 'the colour magenta'], {})
总体而言,解析器匹配OneOrMore
分号或换行符序列,后跟除这些字符以外的任何字符,后跟换行符。
您可以使用;s*(.*?)s*(?=;)
.用法:
print( re.findall(r'(?s);s*(.*?)s*(?=;)', s) )
# output: ['the color blue', 'the color green', 'the color red']
解释:
(?s) # dot-all modifier (. matches newlines)
; # consume a semicolon
s* # skip whitespace
(.*?) # capture the following text, as little as possible, such that...
s* # ... it is followed only by (optional) whitespace, and...
(?=;) # ... a semicolon