仅当新行后没有另一个新行时,才用空格替换新行(撤消文本中的硬换行)



我有一堆带有硬换行的文本文件(即大约 80 个字符的新行)。我想撤消它并将所有这些句子连接在一起,但在它们是新章节或段落的地方保持新行。

即我想将"">

替换为"当且仅当以下字符不是另一个"">

下面的python代码可以做我想要的,但效率不高,我宁愿用正则表达式和/或sed来做。

s = open(filename, 'r').read()
p = s.split('nn') # split into paragraphs
p = [x.replace('n', ' ') for x in p] # iterate all paragraphs, replace n
s2 = 'nn'.join(p) # join paragraphs back together

例如

Lorem ipsum dolor sit amet, consectetur adipiscing
elit. Vivamus porta dui quis aliquet interdum. Sed
in pellentesque libero. Quisque tempus nisl nec
nisl condimentum ullamcorper.
Mauris vulputate nibh nec ipsum mattis rutrum.
Nunc nec tristique magna, non sagittis lacus.
Aliquam id urna lectus.
Maecenas volutpat libero quis erat mollis, et
aliquet purus dignissim. Sed faucibus, lectus in
auctor ornare, dolor libero ultrices sem, vel
iaculis ex nulla quis lacus.

应该变成:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus porta dui quis aliquet interdum. Sed in pellentesque libero. Quisque tempus nisl nec nisl condimentum ullamcorper.
Mauris vulputate nibh nec ipsum mattis rutrum. Nunc nec tristique magna, non sagittis lacus. Aliquam id urna lectus.
Maecenas volutpat libero quis erat mollis, et aliquet purus dignissim. Sed faucibus, lectus in auctor ornare, dolor libero ultrices sem, vel iaculis ex nulla quis lacus.

更新

我已经尝试并在 5MB 文本文件上对下面的 5 种 python 方法进行了计时。我很惊讶地看到所有 3 个正则表达式方法都比 python 拆分/替换/加入方法慢一个数量级。

def m1(s):
p = s.split('nn') # split into paragraphs
p = [x.replace('n', ' ') for x in p] # iterate all paragraphs, replace n
r = 'nn'.join(p) # join paragraphs back together
return r
def m2(s):
r = re.sub(r"(?<!n)n(?!n)", " ", s)
return r
def m3(s):
p = re.compile(ur'(?<!^)n(?=S)', re.MULTILINE)
r = re.sub(p, u" ", s)
return r
def m4(s):
r = "".join(["".join(v) if k else " ".join(map(str.strip, v))+"n"  for k, v in groupby(s, str.isspace)])
return r

def repl(m):
return (' ' if len(m.group(1))==1 else m.group(1)) + m.group(2)
def m5(s):
r = re.sub(r'(n+)(.)', repl, s)
return r

结果:

np.array( timeit.repeat('r=m1(s)', 'from __main__ import *', repeat=5, number=N) )/N
Out[4]: array([ 0.01343679,  0.0136183 ,  0.0153013 ,  0.0122381 ,  0.01205051])
np.array( timeit.repeat('r=m2(s)', 'from __main__ import *', repeat=5, number=N) )/N
Out[5]: array([ 0.10881839,  0.108728  ,  0.10904381,  0.10862441,  0.10867569])
np.array( timeit.repeat('r=m3(s)', 'from __main__ import *', repeat=5, number=N) )/N
Out[6]: array([ 0.1358021 ,  0.1352592 ,  0.13556101,  0.1357465 ,  0.1354876 ])
np.array( timeit.repeat('r=m4(s)', 'from __main__ import *', repeat=5, number=N) )/N
Out[7]: array([ 2.51403842,  2.37821078,  2.4169096 ,  2.56688828,  2.36240571])
np.array( timeit.repeat('r=m5(s)', 'from __main__ import *', repeat=5, number=N) )/N
Out[8]: array([ 0.16381941,  0.1616353 ,  0.1620033 ,  0.1617353 ,  0.1615443 ])

您可以使用awk,如下所示:

awk '{$1=$1}1' RS='' ORS='nn' OFS=' ' file

解释:

  • {$1=$1}看起来它不会改变任何东西。这是真的,但仍然awk将使用新的分隔符重新组合记录(如下所示)

  • 1的计算结果始终为 true,因为未指定任何操作,awk 将打印整个当前记录

  • RS=''位于输入记录分隔符中。空字符串是它的特殊值。这意味着按空行拆分记录,按新行拆分字段。

  • ORS='nn'输出记录分隔符也设置为空行。

  • OFS=' '集是空格的输出字段分隔符

输出:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus porta dui quis aliquet interdum. Sed in pellentesque libero. Quisque tempus nisl nec nisl condimentum ullamcorper.
Mauris vulputate nibh nec ipsum mattis rutrum. Nunc nec tristique magna, non sagittis lacus. Aliquam id urna lectus.
Maecenas volutpat libero quis erat mollis, et aliquet purus dignissim. Sed faucibus, lectus in auctor ornare, dolor libero ultrices sem, vel iaculis ex nulla quis lacus.

使用re.sub(),然后你必须玩弄负数 后瞻和前瞻断言。如果您的输入很大,这将不是很有效。

后视:

(?<!...)
Matches if the current position in the string is not preceded by a match for .... This is called a negative lookbehind assertion. Similar to positive lookbehind assertions, the contained pattern must only match strings of some fixed length. Patterns which start with negative lookbehind assertions may match at the beginning of the string being searched.

展望:

(?!...)
Matches if ... doesn’t match next. This is a negative lookahead assertion. For example, Isaac (?!Asimov) will match 'Isaac ' only if
it’s not followed by 'Asimov'. 

下面是一个示例:

>>> text = """Lorem ipsum dolor sit amet, consectetur adipiscing
elit. Vivamus porta dui quis aliquet interdum. Sed
in pellentesque libero. Quisque tempus nisl nec
nisl condimentum ullamcorper.
Mauris vulputate nibh nec ipsum mattis rutrum.
Nunc nec tristique magna, non sagittis lacus.
Aliquam id urna lectus.
Maecenas volutpat libero quis erat mollis, et
aliquet purus dignissim. Sed faucibus, lectus in
auctor ornare, dolor libero ultrices sem, vel
iaculis ex nulla quis lacus."""
>>> re.sub(r"(?<!n)n(?!n)", " ", text)
'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus porta dui quis aliquet interdum. Sed in pellentesque libero. Quisque tempus nisl nec nisl condimentum ullamcorper.nnMauris vulputate nibh nec ipsum mattis rutrum. Nunc nec tristique magna, non sagittis lacus. Aliquam id urna lectus.nnMaecenas volutpat libero quis erat mollis, et aliquet purus dignissim. Sed faucibus, lectus in auctor ornare, dolor libero ultrices sem, vel iaculis ex nulla quis lacus.'
>>> print(_)
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus porta dui quis aliquet interdum. Sed in pellentesque libero. Quisque tempus nisl nec nisl condimentum ullamcorper.
Mauris vulputate nibh nec ipsum mattis rutrum. Nunc nec tristique magna, non sagittis lacus. Aliquam id urna lectus.
Maecenas volutpat libero quis erat mollis, et aliquet purus dignissim. Sed faucibus, lectus in auctor ornare, dolor libero ultrices sem, vel iaculis ex nulla quis lacus.

您可以使用分组依据,在空格上分组:

from itertools import groupby
with open("test.txt") as f:
print("".join(["".join(v) if k else " ".join(map(str.strip, v))+"n"  for k, v in groupby(f, str.isspace)]))

这会给你:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus porta dui quis aliquet interdum. Sed in pellentesque libero. Quisque tempus nisl nec nisl condimentum ullamcorper.
Mauris vulputate nibh nec ipsum mattis rutrum. Nunc nec tristique magna, non sagittis lacus. Aliquam id urna lectus.
Maecenas volutpat libero quis erat mollis, et aliquet purus dignissim. Sed faucibus, lectus in auctor ornare, dolor libero ultrices sem, vel iaculis ex nulla quis lacus.

我会尝试在python中遵循正则表达式:

假设text变量包含示例文本

import re
p = re.compile(ur'(?<!^)n(?=S)', re.MULTILINE)
result = re.sub(p, u" ", text)
print(result)

它将打印以下文本,用空格替换单个换行符。

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus porta dui quis aliquet interdum. Sed in pellentesque libero. Quisque tempus nisl nec nisl condimentum ullamcorper.
Mauris vulputate nibh nec ipsum mattis rutrum. Nunc nec tristique magna, non sagittis lacus. Aliquam id urna lectus.
Maecenas volutpat libero quis erat mollis, et aliquet purus dignissim. Sed faucibus, lectus in auctor ornare, dolor libero ultrices sem, vel iaculis ex nulla quis lacus.

在正则表达式101观看演示

有时可以通过将函数作为第二个参数传递给re.sub()来完成复杂的替换。

import re
ipsum = '''Lorem ipsum dolor sit amet, consectetur adipiscing
elit. Vivamus porta dui quis aliquet interdum. Sed
in pellentesque libero. Quisque tempus nisl nec
nisl condimentum ullamcorper.
Mauris vulputate nibh nec ipsum mattis rutrum.
Nunc nec tristique magna, non sagittis lacus.
Aliquam id urna lectus.
Maecenas volutpat libero quis erat mollis, et
aliquet purus dignissim. Sed faucibus, lectus in
auctor ornare, dolor libero ultrices sem, vel
iaculis ex nulla quis lacus.
'''
ipsum = re.sub(
r'(n+)(?=.)',
lambda m: ' ' if len(m.group(1))==1 else m.group(1),
ipsum)
print ipsum

最新更新