如何在另一个文本文件中的一个文本文件中找到与字符串的匹配字符串



我有两个文本文件。他们俩都有相同的内容,但是每个格式都不同。在一个文件中,单词或字母之间有额外的空间。也有不同的线路断路。例如:

file1:

The annotation framework we presented is 
embedded in the Knowledge Management and 
Acquisition Platform Semantic Turkey (Pazienza, et 
al., 2012), and comes out-the-box with a few 
annotation families which differ in the underlying 
annotation model and, notably, in the tasks they 
support. The default handlers take into consideration 
the annotation of atomic ontological resources, and 
complex activities that are provided as macros, e.g. 
the creation of new instances, the definition of new 
subclasses in OWL, or of narrower concepts in 
SKOS. 

file2:

Theannotationframework we presented is 
embedded in th e K n o w l e d ge Management and 
Acquisition Platform Semantic Turkey (Pazienza, et 
al., 2012), and comes out-the-
box with a few 
annotation families which differ in the underlying 
annotation model and, notably, in the tasks they 
support. The default handlers take into consideration 
the a n n o t a t i o n  o f a t o m i c ontological resources, and 
complex activities that are provided as macros, e.g. 
the creation of new instances, the definition of new 
subclasses in OWL, or of narrower concepts in 
SKOS.

假设我从file1中选择字符串the Knowledge Management,我想在file2中与字符串th e K n o w l e d ge Management匹配。

我该如何实现?第二个文件中没有固定的畸形。唯一的担保是字符在两个文件中都处于相同的顺序,并且可能会被额外的空间或可能丢失的空间分开。

我考虑使用卖家算法或Viterbi算法,但是,我不确定。近似弦匹配也可能很昂贵。

任何线索都会有所帮助。非常感谢!

您应该意识到您没有两个文本,但实际上是一个字符,所有字符都处于同一位置!

用什么魔术?好吧,这足以剥离所有空白和分离器,或者更好,当您从一个角色向下移动时跳过它们。

您可以轻松地并行遍历两个文本,保持同步,并且无需搜索

例如," the Knowledge Management"one_answers" th e K n o w l e d ge Management"从位置45到67。

运行

如果您不知道第一个文本中搜索字符串的启动位置,请在第一个文本中执行普通搜索(有或没有空格,这取决于您),然后将第二个文本横穿到同一文本位置。

The annotation framework we presented is
0          1         2           3 
0122345678901223467890122344567890123345

如果您需要在文本中执行许多字符串位置,则每次从一开始就会成本昂贵。然后,您可以使用将无空间位置与普通位置相关联的索引表,并在必要时执行二进制搜索。

您可以将文件导入字符串,然后从两者中删除所有空白。然后,它应该是一个直弦匹配的活动。

如果您还需要匹配模式的启动索引,请获取折叠字符串中起点的索引,并在间隔版本上运行循环,仅计算字符。

最新更新