我如何找到一个大字符串的最佳拟合子序列



说我有一个大字符串和一系列子字符串,当加入大字符串时(差异很小)。

例如(注意字符串之间的细微差异):

large_str = "hello, this is a long string, that may be made up of multiple
 substrings that approximately match the original string"
sub_strs = ["hello, ths is a lng strin", ", that ay be mad up of multiple",
 "subsrings tat aproimately ", "match the orginal strng"]

我如何最好地对齐字符串以从原始large_str产生新的子字符串?例如:

["hello, this is a long string", ", that may be made up of multiple",
 "substrings that approximately ", "match the original string"]

附加信息

用例是从PDF文档中提取的文本的现有页面中断原始文本的页面中断。与原始文本相比,从PDF中提取的文本是OCR的,但原始文本没有页面断开。目标是准确打破原始文本,以避免PDF文本的OCR错误。

  1. 连接子字符串
  2. 将串联与原始字符串对齐
  3. 跟踪原始字符串中哪些位置与子字符串之间的边界对齐
  4. 将原始字符串分开与这些边界对齐的位置

使用Python的Difflib的实现:

from difflib import SequenceMatcher
from itertools import accumulate
large_str = "hello, this is a long string, that may be made up of multiple substrings that approximately match the original string"
sub_strs = [
  "hello, ths is a lng strin",
  ", that ay be mad up of multiple",
  "subsrings tat aproimately ",
  "match the orginal strng"]
sub_str_boundaries = list(accumulate(len(s) for s in sub_strs))
sequence_matcher = SequenceMatcher(None, large_str, ''.join(sub_strs), autojunk = False)
match_index = 0
matches = [''] * len(sub_strs)
for tag, i1, i2, j1, j2 in sequence_matcher.get_opcodes():
  if tag == 'delete' or tag == 'insert' or tag == 'replace':
    matches[match_index] += large_str[i1:i2]
    while j1 < j2:
      submatch_len = min(sub_str_boundaries[match_index], j2) - j1
      while submatch_len == 0:
        match_index += 1
        submatch_len = min(sub_str_boundaries[match_index], j2) - j1
      j1 += submatch_len
  else:
    while j1 < j2:
      submatch_len = min(sub_str_boundaries[match_index], j2) - j1
      while submatch_len == 0:
        match_index += 1
        submatch_len = min(sub_str_boundaries[match_index], j2) - j1
      matches[match_index] += large_str[i1:i1+submatch_len]
      j1 += submatch_len
      i1 += submatch_len
print(matches)

输出:

['hello, this is a long string', 
 ', that may be made up of multiple ', 
 'substrings that approximately ', 
 'match the original string']

您正在尝试解决序列对齐问题。就您而言,这是"本地"序列对齐。它可以通过史密斯 - 水手方法来解决。一个可能的实现在这里。如果运行它,您将收到:

large_str = "hello, this is a long string, that may be made up of multiple substrings that approximately match the original string"
sub_strs = ["hello, ths is a lng sin", ", that ay be md up of mulple", "susrings tat aproimately ", "manbvch the orhjgnal strng"]
for sbs in sub_strs:
    water(large_str, sbs)

 >>>
Identity = 85.185 percent
Score = 210
hello, this is a long strin
hello, th s is a l ng s  in
hello, th-s is a l-ng s--in
Identity = 84.848 percent
Score = 255
, that may be made up of multiple
, that  ay be m d  up of mul  ple
, that -ay be m-d- up of mul--ple
Identity = 83.333 percent
Score = 225
substrings that approximately 
su s rings t at a pro imately 
su-s-rings t-at a-pro-imately 
Identity = 75.000 percent
Score = 175
ma--tch the or-iginal string
ma   ch the or  g nal str ng
manbvch the orhjg-nal str-ng

中线显示匹配的字符。如果您需要这些位置,请查找max_i值以获取原始字符串中的 ending 位置。启动位置将是water()函数末尾的i的值。

(附加信息使以下许多不必要。/p>

将有一个动态的编程解决方案,以解决与此问题非常接近的问题。在为您提供编辑距离的动态编程算法中,动态程序的状态为(a,b),其中a是第一个字符串中的偏移,而b是第二个字符串中的偏移。对于每对(a,b),您可以找到与第一个字符串的第一个字符与第二个字符串的第一个b字符匹配的最小的编辑距离,从(a-1,b)锻炼(a,b)-1),(a-1,b)和(a,b-1)。

您现在可以使用状态(a,n,m,b)编写类似的算法,其中a是迄今为止子字符串消耗的字符总数,n是当前子字符串的索引,m是在当前子字符串,B是第二个字符串中匹配的字符数。这解决了通过将B组粘贴在一起的任何可用子字符串的任何数量的副本组成的字符串组成的字符串的问题。

这是一个不同的问题,因为如果您试图从片段中重建一个长字符串,您可能会得到一个不止一次使用相同片段的解决方案,但是如果您这样做,您可能希望答案很明显足以使其产生的子字符串的集合恰好是给它收集的置换。

因为此方法返回的编辑距离始终将与最佳的编辑距离一样好,因此您也可以使用它来计算置换最佳编辑距离的下限,并且运行分支和绑定算法以找到最佳的排列。

最新更新