来自本体语料库Unicode错误的GENSIM句子



我在Windows OS10上,使用Python 2.7.15 |Anaconda。每当我跑步

mymodel=gensim.models.Word2Vec.load (pretrain)
mymodel.min_count = mincount
sentences =gensim.models.word2vec.LineSentence('ontology_corpus.lst')
mymodel.build_vocab(sentences, update=True) # ERROR HERE ****

我得到此错误:

Traceback (most recent call last):
  File "runWord2Vec.py", line 23, in <module>
    mymodel.build_vocab(sentences, update=True)
  File "C:xxxxlibsite-packagesgensimmodelsba
se_any2vec.py", line 936, in build_vocab
    sentences=sentences, corpus_file=corpus_file, progress_per=progress_per, tri
m_rule=trim_rule)
  File "C:xxxxlibsite-packagesgensimmodelswo
rd2vec.py", line 1591, in scan_vocab
    total_words, corpus_count = self._scan_vocab(sentences, progress_per, trim_r
ule)
  File "C:xxxxxlibsite-packagesgensimmodelswo
rd2vec.py", line 1560, in _scan_vocab
    for sentence_no, sentence in enumerate(sentences):
  File "C:xxxxlibsite-packagesgensimmodelswo
rd2vec.py", line 1442, in __iter__
    line = utils.to_unicode(line).split()
  File "C:xxxxlibsite-packagesgensimutils.py"
, line 359, in any2unicode
    return unicode(text, encoding, errors=errors)
  File "C:xxxxxlibencodingsutf_8.py", line 16,
in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe6 in position 124: invalid
 continuation byte

现在,这可以追溯到此lineentence class

class LineSentence(object):
def __init__(self, source, max_sentence_length=MAX_WORDS_IN_BATCH, limit=None):
    self.source = source
    self.max_sentence_length = max_sentence_length
    self.limit = limit
def __iter__(self):
    """Iterate through the lines in the source."""
    try:
        # Assume it is a file-like object and try treating it as such
        # Things that don't have seek will trigger an exception
        self.source.seek(0)
        for line in itertools.islice(self.source, self.limit):
            line = utils.to_unicode(line).split()
            i = 0
            while i < len(line):
                yield line[i: i + self.max_sentence_length]
                i += self.max_sentence_length
    except AttributeError:
        # If it didn't work like a file, use it as a string filename
        with utils.smart_open(self.source) as fin:
            for line in itertools.islice(fin, self.limit):
                line = utils.to_unicode(line).split() # ERROR HERE *************
                i = 0
                while i < len(line):
                    yield line[i: i + self.max_sentence_length]
                    i += self.max_sentence_length

在从错误中可以看到的最后一个返回中,我可以将错误参数更改为错误='imploy ='nighore'或更改此行:

 utils.to_unicode(line).split()

 line.split()

ontology_corpus.lst文件样本:

<http://purl.obolibrary.org/obo/GO_0090141> EquivalentTo <http://purl.obolibrary.org/obo/GO_0065007> and  <http://purl.obolibrary.org/obo/RO_0002213> some <http://purl.obolibrary.org/obo/GO_0000266> 
<http://purl.obolibrary.org/obo/GO_0090141> SubClassOf <http://purl.obolibrary.org/obo/GO_0065007>

问题在于它正在工作,但恐怕由于忽略了编码错误,结果会存在缺陷!是否有解决方案,或者我的方法会很好?

这可能是因为文件中的某些行或行包含未正确编码的数据。

如果build_vocab()否则成功,则如果腐败是无意的,罕见的,或不影响您特别感兴趣的单词向量的,则可能不会对您的最终结果产生太大的影响。t包含任何UTF8损坏或可能有编码问题的字符。(

但是,如果这是一个问题,您可以尝试通过阅读sentences自己来触发build_vocab()之外的错误来确定与问题的确切行。例如:

for i, sentence in enumerate(sentences):
    print(i)

它停止的位置(如果是终止迭代的错误(,或者在错误消息中与行数交错的地方,将为您提供一个问题的提示。您可以检查文本编辑中的这些内容,以查看涉及哪些字符。然后,您可以考虑删除/更改这些字符,也可以尝试发现文件的真实编码&amp;使用涉及范围/字符的知识重新编码为UTF8。

(关于您的明显语料库的单独注释:请注意,如果单个令牌的许多替代示例在整个语料库中散布,则最好是单个代币的许多替代示例,与其他令牌的对比示例交织在一起。因此,如果您的语料库是一个转储从其他一些相关令牌(例如<http://purl.obolibrary.org/obo/GO_0090141>(结合在一起的来源,如果您在训练之前将线路散装,您可能会得到一些改进的最终向量。(

最新更新