通过REGEX获取常见的前缀子弦



喜欢这个

text = "  t  hello theren  t  how are you?n  t HHHH"
      hello there
      how are you?
     HHHH

我可以通过REGEX获得常见的前缀substring吗?

我尝试

In [36]: re.findall(r"(?m)(?:(^[ t]+).+[nr]+1)", "  t  hello theren  t  how are you?n  t HHHH")
Out[36]: ['  t  ']

,但显然普通的前缀substring是' t'
我想使用dedent功能,例如Python TextWrap模块。

我建议

match = re.search(r'(?m)A(.*).*(?:n?^1.*$)*n?Z', text)

请参阅此Demo

这是一个表达式,在文本中找到一个常见的前缀:

r'^(.+).*(n1.*)*$'

示例:

import re
text = (
    "No Red Leicestern"
    "No Tilsitn"
    "No Red Windsor"
)
m = re.match(r'^(.+).*(n1.*)*$', text)
if m:
    print 'common prefix is', m.group(1)
else:
    print 'no common prefix'

请注意,此表达式涉及很多回溯,因此请明智地使用它,尤其是在大型输入中。

要找出最长的常见"空间"前缀,只需找到它们并应用len

def dedent(text):
    prefix_len = min(map(len, re.findall('(?m)^s+', text)))
    return re.sub(r'(?m)^.{%d}' % prefix_len, '', text)
text = (
    "     No Red Leicestern"
    "    No Tilsitn"
    "tt   No Red Windsor"
)
print dedent(text)

我对python不太好,所以,也许该代码对语言看起来并不惯用,但是算法上应该很好:

>>> import StringIO
...
>>> def strip_common_prefix(text):
...     position = text.find('n')
...     offset = position
...     match = text[: position + 1]
...     lines = [match]
...     while match and position != len(text):
...         next_line = text.find('n', position + 1)
...         if next_line == -1: next_line = len(text)
...         line = text[position + 1 : next_line + 1]
...         position = next_line
...         lines.append(line)
...         i = 0
...         for a, b in zip(line, match):
...             if i > offset or a != b: break
...             i += 1
...         offset = i
...         match = line[: offset]
...     buf = StringIO.StringIO()
...     for line in lines:
...         if not match: buf.write(line)
...         else: buf.write(line[offset :])
...     text = buf.getvalue()
...     buf.close()
...     return text
... 
>>> strip_common_prefix("  t  hello theren  t  how are you?n  t HHHH")
' hello theren how are you?nHHHH'
>>> 

正则表达式将在此上面有很多开销。

import os
#not just  for paths...
text = "  t  hello theren  t  how are you?n  t HHHH"
li = text.split("n")
common = os.path.commonprefix(li)
li = [i[len(common):] for i in li]
for i in li:
    print i

=>

 hello there
 how are you?
HHHH

最新更新