无法在从精灵宝钻中提取的文本上使用 NLTK

我正在尝试使用Tolkein的Silmarillion作为练习文本，用nltk学习一些NLP。

我很难开始，因为我遇到了文本编码问题。

我正在使用TextBlob包装器(https://github.com/sloria/TextBlob)围绕NLTK，因为它要容易得多。TextBlog位于：

我无法解析的句子是：

"But Húrin did not answer, and they sat beside the stone, and did not speak again".

我相信是胡林的特殊性格引起了这个问题。

我的代码：

from text.blob import TextBlob
b = TextBlob( 'But Húrin did not answer, and they sat beside the stone, and did not speak again' )
b.noun_phrases
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)

由于这只是一个有趣的项目，我只想能够使用这些文本，提取一些属性，并进行一些基本处理。

当我不知道初始编码是什么时，如何将此文本转换为ASCII？我尝试从UTF8解码，然后重新编码为ASCII：

>>> asc = unicode_text.decode('utf-8')
>>> asc = unicode_text.encode('ascii')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 10: ordinal not in range(128)

但即便如此，也不必担心。任何建议都是值得赞赏的——我不介意丢失特殊字符，只要在整个文档中保持一致。

我使用的是python2.6.8，所需的模块也已正确安装。

首先，将TextBlob更新到最新版本（截至本文撰写之时为0.6.0），因为在最近的更新中有一些unicode修复。这可以通过完成

$ pip install -U textblob

然后，使用unicode文字，如下所示：

from text.blob import TextBlob
b = TextBlob( u'But Húrin did not answer, and they sat beside the stone, and did not speak again' )
noun_phrases = b.noun_phrases
print noun_phrases
# WordList([u'hxfarin'])
print noun_phrases[0]
# húrin

这在Python 2.7.5上用TextBlob 0.6.0进行了验证，但它也应该与Python 2.6.8一起使用。

相关内容

最新更新

热门标签：