我在Windows OS10上,使用Python 2.7.15 |Anaconda。每当我跑步
mymodel=gensim.models.Word2Vec.load (pretrain)
mymodel.min_count = mincount
sentences =gensim.models.word2vec.LineSentence('ontology_corpus.lst')
mymodel.build_vocab(sentences, update=True) # ERROR HERE ****
我得到此错误:
Traceback (most recent call last):
File "runWord2Vec.py", line 23, in <module>
mymodel.build_vocab(sentences, update=True)
File "C:xxxxlibsite-packagesgensimmodelsba
se_any2vec.py", line 936, in build_vocab
sentences=sentences, corpus_file=corpus_file, progress_per=progress_per, tri
m_rule=trim_rule)
File "C:xxxxlibsite-packagesgensimmodelswo
rd2vec.py", line 1591, in scan_vocab
total_words, corpus_count = self._scan_vocab(sentences, progress_per, trim_r
ule)
File "C:xxxxxlibsite-packagesgensimmodelswo
rd2vec.py", line 1560, in _scan_vocab
for sentence_no, sentence in enumerate(sentences):
File "C:xxxxlibsite-packagesgensimmodelswo
rd2vec.py", line 1442, in __iter__
line = utils.to_unicode(line).split()
File "C:xxxxlibsite-packagesgensimutils.py"
, line 359, in any2unicode
return unicode(text, encoding, errors=errors)
File "C:xxxxxlibencodingsutf_8.py", line 16,
in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe6 in position 124: invalid
continuation byte
现在,这可以追溯到此lineentence class
class LineSentence(object):
def __init__(self, source, max_sentence_length=MAX_WORDS_IN_BATCH, limit=None):
self.source = source
self.max_sentence_length = max_sentence_length
self.limit = limit
def __iter__(self):
"""Iterate through the lines in the source."""
try:
# Assume it is a file-like object and try treating it as such
# Things that don't have seek will trigger an exception
self.source.seek(0)
for line in itertools.islice(self.source, self.limit):
line = utils.to_unicode(line).split()
i = 0
while i < len(line):
yield line[i: i + self.max_sentence_length]
i += self.max_sentence_length
except AttributeError:
# If it didn't work like a file, use it as a string filename
with utils.smart_open(self.source) as fin:
for line in itertools.islice(fin, self.limit):
line = utils.to_unicode(line).split() # ERROR HERE *************
i = 0
while i < len(line):
yield line[i: i + self.max_sentence_length]
i += self.max_sentence_length
在从错误中可以看到的最后一个返回中,我可以将错误参数更改为错误='imploy ='nighore'或更改此行:
utils.to_unicode(line).split()
:
line.split()
ontology_corpus.lst文件样本:
<http://purl.obolibrary.org/obo/GO_0090141> EquivalentTo <http://purl.obolibrary.org/obo/GO_0065007> and <http://purl.obolibrary.org/obo/RO_0002213> some <http://purl.obolibrary.org/obo/GO_0000266>
<http://purl.obolibrary.org/obo/GO_0090141> SubClassOf <http://purl.obolibrary.org/obo/GO_0065007>
问题在于它正在工作,但恐怕由于忽略了编码错误,结果会存在缺陷!是否有解决方案,或者我的方法会很好?
这可能是因为文件中的某些行或行包含未正确编码的数据。
如果build_vocab()
否则成功,则如果腐败是无意的,罕见的,或不影响您特别感兴趣的单词向量的,则可能不会对您的最终结果产生太大的影响。t包含任何UTF8损坏或可能有编码问题的字符。(
但是,如果这是一个问题,您可以尝试通过阅读sentences
自己来触发build_vocab()
之外的错误来确定与问题的确切行。例如:
for i, sentence in enumerate(sentences):
print(i)
它停止的位置(如果是终止迭代的错误(,或者在错误消息中与行数交错的地方,将为您提供一个问题的提示。您可以检查文本编辑中的这些内容,以查看涉及哪些字符。然后,您可以考虑删除/更改这些字符,也可以尝试发现文件的真实编码&amp;使用涉及范围/字符的知识重新编码为UTF8。
(关于您的明显语料库的单独注释:请注意,如果单个令牌的许多替代示例在整个语料库中散布,则最好是单个代币的许多替代示例,与其他令牌的对比示例交织在一起。因此,如果您的语料库是一个转储从其他一些相关令牌(例如<http://purl.obolibrary.org/obo/GO_0090141>
(结合在一起的来源,如果您在训练之前将线路散装,您可能会得到一些改进的最终向量。(