glove2word2vec()的输出与keyed_vvectors.save()有何不同



我是NLP的新手,遇到了一个我根本不理解的问题:

我有一个带有gloVe矢量的文本文件。我使用将其转换为Word2Verc

glove2word2vec(TXT_FILE_PATH, KV_FILE_PATH)

这在我的路径中创建了一个KV文件,然后可以使用加载

word_vectors = KeyedVectors.load_word2vec_format(KV_FILE_PATH, binary=False)

然后我用保存

word_vectors.save(KV_FILE_PATH)

但当我现在尝试在intersect_word2vec_format中使用新的KV文件时,它会给我一个编码错误

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-11-d975bb14af37> in <module>
6 
7 print("Intersect with pre-trained model...")
----> 8 model.intersect_word2vec_format(KV_FILE_PATH, binary=False)
9 
10 print("Train custom word2vec model...")
/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/gensim/models/word2vec.py in intersect_word2vec_format(self, fname, lockf, binary, encoding, unicode_errors)
890         logger.info("loading projection weights from %s", fname)
891         with utils.open(fname, 'rb') as fin:
--> 892             header = utils.to_unicode(fin.readline(), encoding=encoding)
893             vocab_size, vector_size = (int(x) for x in header.split())  # throws for invalid file format
894             if not vector_size == self.wv.vector_size:
/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/gensim/utils.py in any2unicode(text, encoding, errors)
366     if isinstance(text, unicode):
367         return text
--> 368     return unicode(text, encoding, errors=errors)
369 
370 
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

.save()方法以Gensim的原生格式保存模型,该格式主要是Python pickle,将大数组作为单独的文件(必须与主保存文件一起保存(。

该格式与load_word2vec_format()intersect_word2vec_format()可以加载的word2vec_format不同。

如果要将一组矢量保存到word2vec_format中,请使用方法.save_word2vec_format(),而不是普通的.save()

最新更新