正在修复字符串中的错误分隔符

给定不正确的字符串：

s="rate implies depreciation. Th  e straight lines show eff ective linear time trends in the nominal (dashed "

我想输出正确的字符串，如：

s="rate implies depreciation. The straight lines show effective linear time trends in the nominal (dashed"

如果我尝试使用删除所有分隔符

re.sub("\s*","",s)

它会给我："利率意味着折旧。这些行显示了名义(虚线(的有效内部时间趋势，这不是我想要的

您可以尝试检查单词拼写，例如使用pyspellchecker

(pip安装pyspellchecker(

from spellchecker import SpellChecker
spell = SpellChecker()
s="rate implies depreciation. Th  e straight lines show eff ective linear time trends in the nominal (dashed "
splitted_s = s.split(' ')
splitted_s = list(filter(None, splitted_s)) #remove empty element in between two consecutive space

然后检查单词是否不存在，但previous_word+word存在：

valid_s = [splitted_s[0]]
for i in range(1,len(splitted_s)):
word = splitted_s[i]
previous_word = splitted_s[i-1]
valid_s.append(word)
if spell.unknown([word]) and len(word)>0:
if not spell.unknown([(previous_word+word).lower()]):
valid_s.pop()
valid_s.pop()
valid_s.append(previous_word+word)
print(' '.join(valid_s))
>>>rate implies depreciation. Th e straight lines show effective linear time trends in the nominal (dashed

但在这里，因为e在字典中是一个单词，所以它不会将th和e 连接起来

因此，如果previous_word+单词在字典中比单词更频繁，你也可以比较单词频率，并将previous_word+单词连接起来：

valid_s = [splitted_s[0]]
for i in range(1,len(splitted_s)):
word = splitted_s[i]
previous_word = splitted_s[i-1]
valid_s.append(splitted_s[i])
if spell.word_probability(word.lower())<spell.word_probability((previous_word+word).lower()):
valid_s.pop()
valid_s.pop()
valid_s.append(previous_word+word)

print(' '.join(valid_s))
>>>rate implies depreciation. The straight lines show effective linear time trends in the nominal (dashed

相关内容

最新更新

热门标签：