访问通用句子编码器训练词汇表



我的这个问题是基于这个类似的问题,但是多语言通用嵌入有一个稍微不同的结构:

saved_model = loader_impl.parse_saved_model("/path_to/universal_sent_encoder")
graph = saved_model.meta_graphs[0].graph_def
fns = [f for f in graph.library.function if "ptb" in str(f).lower()][0].node_def
print(len(fns))
>>> 1272
nodes = [n for n in fns if 'SentencepieceOp' in n.name]
model_string = nodes[0].attr.get('model').s

我看到一个字节字符串,我假设它是一个压缩的列表/字典:

model_string[100:200]
>>> b"x19nx10extra_token_id_3x15x00x00x00x00x18x04nnnx03xe2x96x81x15_xbaUxc0nx08nx01,x15~xdacxc0nx08nx01.x15x08xf6dxc0nx08nx01sx15xe8xa8x8bxc0nx0bnx04xe2x96x81ax15xaf x9bxc0nx08nx01'x15jxe9x9bxc0nrnx06xe2x96x81th"

但我已经尝试了多种方法来解压缩它:

decoded_model_string = codecs.decode(model_string, 'ISO-8859-1') # decodes just fine

pickle.loads(model_string)
>>>
UnpicklingError                           Traceback (most recent call last)
<ipython-input-183-857101d05cb4> in <module>
----> 1 pickle.loads(model_string)
UnpicklingError: invalid load key, 'x0a'
pickle.loads(model_string.encode('utf-8'))
>>>
UnpicklingError                           Traceback (most recent call last)
<ipython-input-183-857101d05cb4> in <module>
----> 1 pickle.loads(model_string)
UnpicklingError: invalid load key, 'x0a'

我也尝试了tensorflow.io.decode_raw,但也遇到了utf解码错误。

需要一点时间,但我必须加载

import sentencepiece as spm
sp_model = spm.SentencePieceProcessor()
sp_model.LoadFromSerializedProto(model_string)
vocab = {sp_model.IdToPiece(i): i for i in range(sp_model.GetPieceSize())}

最新更新