使用NLTK或类似工具来判断句子边界

我知道如何使用NLTK PunktPennceToken拆分句子。

然而，我还有另一个请求：我有一个从pdf转换而来的文本，其中的分页符可以拆分句子。有没有任何方法可以使用NLTK来判断字符串结尾是否是句子边界？如果不是句子边界，我可以把这个字符串和下一个字符串连接起来。

例如，以下是我的字符串：

"我已转换文本"有什么办法拯救人类吗？">

第一个不是句子结尾，第二个是。

如果您使用英语，nltk已经为您提供了一个API：english.pickle。

import nltk.data
text = '''
(How does it deal with this parenthesis?)  "It should be part of the
previous sentence." "(And the same with this one.)" ('And this one!')
"('(And (this)) '?)" [(and this. )]
'''
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
print('n-----n'.join(sent_detector.tokenize(text.strip())))

输出：

(How does it deal with this parenthesis?)
-----
"It should be part of the
previous sentence."
-----
"(And the same with this one.)"
-----
('And this one!')
-----
"('(And (this)) '?)"
-----
[(and this. )]

在nltk.tokenize 中阅读更多信息

相关内容

最新更新

热门标签：