正在修复字符串中的错误分隔符



给定不正确的字符串:

s="rate implies depreciation. Th  e straight lines show eff ective linear time trends in the nominal (dashed "

我想输出正确的字符串,如:

s="rate implies depreciation. The straight lines show effective linear time trends in the nominal (dashed"

如果我尝试使用删除所有分隔符

re.sub("\s*","",s)

它会给我:"利率意味着折旧。这些行显示了名义(虚线(的有效内部时间趋势,这不是我想要的

您可以尝试检查单词拼写,例如使用pyspellchecker

(pip安装pyspellchecker(

from spellchecker import SpellChecker
spell = SpellChecker()
s="rate implies depreciation. Th  e straight lines show eff ective linear time trends in the nominal (dashed "
splitted_s = s.split(' ')
splitted_s = list(filter(None, splitted_s)) #remove empty element in between two consecutive space

然后检查单词是否不存在,但previous_word+word存在:

valid_s = [splitted_s[0]]
for i in range(1,len(splitted_s)):
word = splitted_s[i]
previous_word = splitted_s[i-1]
valid_s.append(word)
if spell.unknown([word]) and len(word)>0:
if not spell.unknown([(previous_word+word).lower()]):
valid_s.pop()
valid_s.pop()
valid_s.append(previous_word+word)
print(' '.join(valid_s))
>>>rate implies depreciation. Th e straight lines show effective linear time trends in the nominal (dashed

但在这里,因为e在字典中是一个单词,所以它不会将th和e 连接起来

因此,如果previous_word+单词在字典中比单词更频繁,你也可以比较单词频率,并将previous_word+单词连接起来:

valid_s = [splitted_s[0]]
for i in range(1,len(splitted_s)):
word = splitted_s[i]
previous_word = splitted_s[i-1]
valid_s.append(splitted_s[i])
if spell.word_probability(word.lower())<spell.word_probability((previous_word+word).lower()):
valid_s.pop()
valid_s.pop()
valid_s.append(previous_word+word)

print(' '.join(valid_s))
>>>rate implies depreciation. The straight lines show effective linear time trends in the nominal (dashed

最新更新