Python:对文本页面(文件)中的脚注重新编号的算法



假设您有文本文件,每个文件都包含一本书中各页的文本。假设每页有0到10个脚注,并且每个章节的所有页面都从1到N编号。现在,假设书中某一章的最后一页也将与下一章的第一页重叠。

脚注在页面的文本中用语法:(1)声明。

这是重叠的页面,让我适合重新编号每页的脚注。我希望每一页都有脚注从1到N为该页。

下面是一个特殊情况的例子,对于我想到的所有循环都是有问题的:

原文示例:

A footnote from the last part of a chapter might begin with any number footnote(2).  
This might be in the last paragraph of some chapter that is ending.
Some Next Chapter DD
A single line(1) of text might have multiple footnotes(2) in it on the same line.
Then a new line of text has another footnote(3) in it.

我想对上面示例页的脚注重新编号,以生成下面的示例页:

愿望脚注重新编号页面:
-----脚注开始的文本示例页面-----

A footnote from the last part of a chapter might begin with any number footnote(1).  
This might be in the last paragraph of some chapter that is ending. 
Some Next Chapter DD
A single line(2) of text might have multiple footnotes(3) in it on the same line.
Then a new line of text has another footnote(4) in it.

对于Python,我还没有发现任何有效的循环算法-无论您是立即对文件进行更正,还是缓冲更正-循环的下一轮可能正确地重新编号正确的脚注,或者可能在前一次循环中混淆已经更正的脚注。我是否需要使用文件查找操作,或者某种regex循环可以处理这个?

我现在有一个解决这个问题的办法。事实证明,内联更改有时会导致在同一行上出现两个相同的脚注,而第二个相同的脚注是下一个要更改的脚注。使用regex将击中之前更改的第一个。处理这种情况要小心一点。

对于下面的代码,page是来自file_handle.readlines()

的文本行列表

def replace_nth_substring_in_string(string, old, new, nth):
split_location = [m.start() for m in re.finditer(old, string)][nth - 1]
(head, tail) = (string[:split_location], string[split_location:])
tail = tail.replace(old, new, 1)
return head + tail

new_num = 1
for i in range(len(page)):
footnote_matches = re.findall( '(d+)', page[i] )
for nth, match in enumerate( footnote_matches, start=1):
old = match
new = '({})'.format(new_num)
# grabbing this piece of info is key !
num_old_foots_on_line = page[i].count( old ) 
# normal case; simple replace
if num_old_foots_on_line == 1:
page[i] = page[i].replace( old, new, nth )
# if a previous correction has now caused two idential footnotes
# then replace the last one only ...
elif num_old_foots_on_line == 2:
page[i] = replace_nth_substring_in_string(page[i], old, new, 1)
# for my case, there should never be more than three identical
# but for others, they may have to handle this case
else:
print("There are three (or more) footnotes on this line")
sys.exit()
new_num+=1

最新更新