Python有可能重建一个混乱的句子来匹配一个完整的句子吗



我有一个句子的CSV和另一个CSV,其中相同的句子被打断和混淆。

例如,一个CSV具有:

The quick brown fox jumps over the lazy dog.

另一个CSV有:

jumps over the
The quick brown fox
lazy dog.

每个CSV都有一个以上的句子,但希望你能从上面的例子中得到这个想法。

我使用了模糊匹配来查看它们是否匹配,但现在我想重构这个句子
Python是否可以重建混乱的CSV以匹配完整的句子?

这是一个极具挑战性的问题!

我尝试了一些东西,并在下面的代码评论中解释了相同的内容:

#Original Sentences
clean_sentences = [
"The quick brown fox jumps over the lazy dog.",
"A wizard's job is to vex chumps quickly in fog."
]
#CSV in the form of a list
jumbled_sentences = [
"is to vex chumps ",
"jumps over the ",
"The quick brown fox ",
"quickly in fog.",
"lazy dog.",
"A wizard's job ",
]
# from fuzzywuzzy import fuzz, process
from rapidfuzz import fuzz, process # use this for faster results when a lot of fuzzywuzzy operations are to be done
for clean_sentence in clean_sentences:
ordered_sentences = []
#we find only those jumbled sentences who are 100% present(thats why partial ratio) in the original sentence
fuzzResults = process.extract(clean_sentence, jumbled_sentences, scorer=fuzz.partial_ratio, score_cutoff=100)
sentences_found = [fuzzResult[0] for fuzzResult in fuzzResults] #retrieve only sentence from fuzzy result
index_sent_dict = {}
for sentence_found in sentences_found:

#we find index of each jumbled index and store it as dixtionary of {index:sentence}
index_sent_dict.update({clean_sentence.index(sentence_found): sentence_found})

#and then we sort the dictionary based on index and join the keys of the sorted dictionary
sorted_dict = dict(sorted(index_sent_dict.items()))

final_sentence = "".join(list(sorted_dict.values()))
print(final_sentence)
# The quick brown fox jumps over the lazy dog.
# A wizard's job is to vex chumps quickly in fog.

相关内容

  • 没有找到相关文章

最新更新