例如,我有一组这样的句子:
New York is in New York State
D.C. is the capital of United States
The weather is cool in the south of that country.
Lets take a bus to get to point b from point a.
还有一句这样的话:
is cool in the south of that country
输出应为:The weather is cool in the south of that country.
如果我有一个类似of United States The weather is cool
的输入,输出应该是:
D.C. is the capital of United States The weather is cool in the south of that country.
到目前为止,我尝试了difflib
并得到了重叠,但这并不能完全解决所有情况下的问题。
您可以根据句子构建一个开始表达式和结束表达式的字典。然后在这些词典中为句子找到一个前缀和后缀来扩展。在这两种情况下,您都需要为从头到尾的每个单词子串构建/检查一个密钥:
sentences="""New York is in New York State
D.C. is the capital of United States
The weather is cool in the south of that country
Lets take a bus to get to point b from point a""".split("n")
ends = { tuple(sWords[i:]):sWords[:i] for s in sentences
for sWords in [s.split()] for i in range(len(sWords)) }
starts = { tuple(sWords[:i]):sWords[i:] for s in sentences
for sWords in [s.split()] for i in range(1,len(sWords)+1) }
def extendSentence(sentence):
sWords = sentence.split(" ")
prefix = next( (ends[p] for i in range(1,len(sWords)+1)
for p in [tuple(sWords[:i])] if p in ends),
[])
suffix = next( (starts[p] for i in range(len(sWords))
for p in [tuple(sWords[i:])] if p in starts),
[])
return " ".join(prefix + [sentence] + suffix)
输出:
print(extendSentence("of United States The weather is cool"))
# D.C. is the capital of United States The weather is cool in the south of that country
print(extendSentence("is cool in the south of that country"))
# The weather is cool in the south of that country
注意,我不得不删除句子末尾的句号,因为它们阻止了匹配。您需要在字典构建步骤中清理这些