我有一个句子的CSV和另一个CSV,其中相同的句子被打断和混淆。
例如,一个CSV具有:
The quick brown fox jumps over the lazy dog.
另一个CSV有:
jumps over the
The quick brown fox
lazy dog.
每个CSV都有一个以上的句子,但希望你能从上面的例子中得到这个想法。
我使用了模糊匹配来查看它们是否匹配,但现在我想重构这个句子
Python是否可以重建混乱的CSV以匹配完整的句子?
这是一个极具挑战性的问题!
我尝试了一些东西,并在下面的代码评论中解释了相同的内容:
#Original Sentences
clean_sentences = [
"The quick brown fox jumps over the lazy dog.",
"A wizard's job is to vex chumps quickly in fog."
]
#CSV in the form of a list
jumbled_sentences = [
"is to vex chumps ",
"jumps over the ",
"The quick brown fox ",
"quickly in fog.",
"lazy dog.",
"A wizard's job ",
]
# from fuzzywuzzy import fuzz, process
from rapidfuzz import fuzz, process # use this for faster results when a lot of fuzzywuzzy operations are to be done
for clean_sentence in clean_sentences:
ordered_sentences = []
#we find only those jumbled sentences who are 100% present(thats why partial ratio) in the original sentence
fuzzResults = process.extract(clean_sentence, jumbled_sentences, scorer=fuzz.partial_ratio, score_cutoff=100)
sentences_found = [fuzzResult[0] for fuzzResult in fuzzResults] #retrieve only sentence from fuzzy result
index_sent_dict = {}
for sentence_found in sentences_found:
#we find index of each jumbled index and store it as dixtionary of {index:sentence}
index_sent_dict.update({clean_sentence.index(sentence_found): sentence_found})
#and then we sort the dictionary based on index and join the keys of the sorted dictionary
sorted_dict = dict(sorted(index_sent_dict.items()))
final_sentence = "".join(list(sorted_dict.values()))
print(final_sentence)
# The quick brown fox jumps over the lazy dog.
# A wizard's job is to vex chumps quickly in fog.