喜欢这个
text = " t hello theren t how are you?n t HHHH"
hello there
how are you?
HHHH
我可以通过REGEX获得常见的前缀substring吗?
我尝试
In [36]: re.findall(r"(?m)(?:(^[ t]+).+[nr]+1)", " t hello theren t how are you?n t HHHH")
Out[36]: [' t ']
,但显然普通的前缀substring是' t'
我想使用dedent
功能,例如Python TextWrap模块。
我建议
match = re.search(r'(?m)A(.*).*(?:n?^1.*$)*n?Z', text)
请参阅此Demo 。
这是一个表达式,在文本中找到一个常见的前缀:
r'^(.+).*(n1.*)*$'
示例:
import re
text = (
"No Red Leicestern"
"No Tilsitn"
"No Red Windsor"
)
m = re.match(r'^(.+).*(n1.*)*$', text)
if m:
print 'common prefix is', m.group(1)
else:
print 'no common prefix'
请注意,此表达式涉及很多回溯,因此请明智地使用它,尤其是在大型输入中。
要找出最长的常见"空间"前缀,只需找到它们并应用len
:
def dedent(text):
prefix_len = min(map(len, re.findall('(?m)^s+', text)))
return re.sub(r'(?m)^.{%d}' % prefix_len, '', text)
text = (
" No Red Leicestern"
" No Tilsitn"
"tt No Red Windsor"
)
print dedent(text)
我对python不太好,所以,也许该代码对语言看起来并不惯用,但是算法上应该很好:
>>> import StringIO
...
>>> def strip_common_prefix(text):
... position = text.find('n')
... offset = position
... match = text[: position + 1]
... lines = [match]
... while match and position != len(text):
... next_line = text.find('n', position + 1)
... if next_line == -1: next_line = len(text)
... line = text[position + 1 : next_line + 1]
... position = next_line
... lines.append(line)
... i = 0
... for a, b in zip(line, match):
... if i > offset or a != b: break
... i += 1
... offset = i
... match = line[: offset]
... buf = StringIO.StringIO()
... for line in lines:
... if not match: buf.write(line)
... else: buf.write(line[offset :])
... text = buf.getvalue()
... buf.close()
... return text
...
>>> strip_common_prefix(" t hello theren t how are you?n t HHHH")
' hello theren how are you?nHHHH'
>>>
正则表达式将在此上面有很多开销。
import os
#not just for paths...
text = " t hello theren t how are you?n t HHHH"
li = text.split("n")
common = os.path.commonprefix(li)
li = [i[len(common):] for i in li]
for i in li:
print i
=>
hello there
how are you?
HHHH