NLTK - 在自定义语料库中解码 Unicode - NLTK - Decoding Unicode in custom corpus 小贝子编程网

我使用 nltk 的CategorizedPlaintextCorpusReader创建了一个自定义语料库。

我的语料库的.txt文件中有 unicode 字符，我无法解码。我认为这是一个"明文"阅读器，但仍然需要解码。

法典：

import nltk
from nltk.corpus import CategorizedPlaintextCorpusReader
import os

mr = CategorizedPlaintextCorpusReader('C:mycorpus', r'(?!.).*.txt',
cat_pattern=os.path.join(r'(neg|pos)', '.*',))
for w in mr.words():
print(w)

这将以标记化格式打印不包含 unicode 的文件的单词，然后引发以下错误：

for w in mr.words():
File "C:PythonPython36-32libsite-packagesnltkcorpusreaderutil.py", line 402, in iterate_from
for tok in piece.iterate_from(max(0, start_tok-offset)):
File "C:PythonPython36-32libsite-packagesnltkcorpusreaderutil.py", line 296, in iterate_from
tokens = self.read_block(self._stream)
File "C:PythonPython36-32libsite-packagesnltkcorpusreaderplaintext.py", line 122, in _read_word_block
words.extend(self._word_tokenizer.tokenize(stream.readline()))
File "C:PythonPython36-32libsite-packagesnltkdata.py", line 1168, in readline
new_chars = self._read(readsize)
File "C:PythonPython36-32libsite-packagesnltkdata.py", line 1400, in _read
chars, bytes_decoded = self._incr_decode(bytes)
File "C:PythonPython36-32libsite-packagesnltkdata.py", line 1431, in _incr_decode
return self.decode(bytes, 'strict')
File "C:PythonPython36-32libencodingsutf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 30: invalid start byte

我试图解码：

mr.decode('unicode-escape')

引发此错误：

AttributeError: 'CategorizedPlaintextCorpusReader' object has no attribute 'decode'

我正在使用Python 3.6.4。

问题是NLTK的语料库阅读器假设您的纯文本文件是用UTF-8编码保存的。但是，这种假设显然是错误的，因为文件是用另一个编解码器编码的。我的猜测是使用了CP1252(又名"Windows Latin-1")，因为它非常流行并且非常适合您的描述：在该编码中，em破折号"–"使用字节0x96编码，这在错误消息中提到。

您可以在语料库读取器的构造函数中指定输入文件的编码：

mr = CategorizedPlaintextCorpusReader(
'C:mycorpus',
r'(?!.).*.txt',
cat_pattern=os.path.join(r'(neg|pos)', '.*',),
encoding='cp1252')

试试这个，并检查输出中的非 ASCII 字符(长破折号、项目符号)是否仍然正确(并且未替换为 mojibake)。

NLTK - 在自定义语料库中解码 Unicode

相关内容

最新更新

热门标签：