如何找到多个文档中存在的所有最长的公共子字符串?



我有很多文本文档,我想相互比较,并删除它们之间完全相同的所有文本。这是为了删除查找一致的样板文本,以便可以针对 NLP 删除它。

我认为这样做的最好方法是找到所有文档中存在或主要存在的最长公共子字符串。然而,这样做的速度非常慢。

以下是我试图完成的示例:

文档A:

Title: To Kill a Mocking Bird
Author: Harper Lee
Published: July 11, 1960

文档B:

Title: 1984
Author: George Orwell
Published: June 1949

文档:

Title: The Great Gatsby
Author: F. Scott Fitzgerald

输出将显示如下内容:

{
'Title': 3,
'Author': 3,
'Published': 2,
}

然后,结果将用于去除文件之间的共同点。

这是我用python测试过的一些代码。任何大量的排列都令人难以置信:

file_perms = list(itertools.permutations(files, 2))
results = {}
for p in file_perms:
doc_a = p[0]
doc_b = p[1]
while True:
seq_match = SequenceMatcher(a=doc_a, b=doc_b)
match = seq_match.find_longest_match(0, len(doc_a), 0, len(doc_b)) 
if (match.size >= 5): 
doc_a_start, doc_a_stop = match.a, match.a + match.size
doc_b_start, doc_b_stop = match.b, match.b + match.size 
match_word = doc_a[doc_a_start:doc_a_stop]
if match_word in results:
results[match_word] += 1
else:
results[match_word] = 1
doc_a = doc_a[:doc_a_start] + doc_a[doc_a_stop:]
doc_b = doc_b[:doc_b_start] + doc_b[doc_b_stop:]
else: 
break 
df = pd.DataFrame(
{
'Value': [x for x in results.keys()],
'Count': [x for x in results.values()]
}
)
print(df)

从每个文档创建一个集合, 为每个单词生成一个计数器,它出现多少次 遍历每个文档,当你发现一个单词出现在 70% -90% 的文档中时, 将其及其后面的单词作为元组附加到新计数器 再说一遍..

from collections import Counter
one_word = Counter()
for doc in docs:
word_list = docs.split(" ")
word_set = set(word_list)
for word in word_set:
one_word[word]+=1
two_word = Counter()
threshold = len(docs)*0.7
for doc in docs:
word_list = doc.split(" ")
for i in range(len(word_list)-1):
if one_word[word_list[i]]>threshold:
key = (word_list[i], word_list[i+1])

您可以使用阈值并继续,只要计数器不为空

文档是歌曲的歌词 信徒,在巴比伦河边,我可以保持清醒,拉特林沼泽

from collections import Counter
import os
import glob
TR =1 #threshold
dir = r"D:docs"
path = os.path.join(dir,"*.txt")
files = glob.glob(path)
one_word = {}
all_docs = {}
for file in files:
one_word[file] = set()
all_docs[file] = []
with open(file) as doc:
for row in doc:
for word in row.split():
one_word[file].add(word)
all_docs[file].append(word)
#now one_word is a dict where the kay is file name and the value is set of words in it
#all_docs is a dict file name is the key and the value is the complete doc stord in a list word by word
common_Frase = Counter()
for key in one_word:
for word in one_word[key]:
common_Frase[word]+=1
#common_Frase containe a count of all words appearence in all files (every file can add a word once)
two_word = {}
for key in all_docs:
two_word[key] = set()
doc = all_docs[key]
for index in range(len(doc)-1):
if common_Frase[doc[index]]>TR:
val = (doc[index], doc[index+1])
two_word[key].add(val)
for key in two_word:
for word in two_word[key]:
common_Frase[word]+=1
#now common_Frase contain a count of all two words frase
three_word = {}
for key in all_docs:
three_word[key] = set()
doc = all_docs[key]
for index in range(len(doc)-2):
val2 = (doc[index], doc[index+1])
if common_Frase[val2]>TR:
val3 = (doc[index], doc[index+1], doc[index+2])
three_word[key].add(val3)

for key in three_word:
for word in three_word[key]:
common_Frase[word]+=1
for k in common_Frase:
if common_Frase[k]>1:
print(k)

这是输出

当像所有一样 不要 和一个 我的 听到和感觉 然后你的 我在我里面 你走了 我永远不会成为什么 从那里的东西 通过 现在 单词是 ("所有","那个

"( ("和","那个"( ("那个","词"( ("由","那个"( ("和","那个"( ("在","那个"(

最新更新