嵌入keras的预测导致索引不在列表中



我有一个用训练的模型

common_embed = Embedding(
name="synopsis_embedd",
input_dim =len(t.word_index)+1,
output_dim=len(embeddings_index['no']),
weights=[embedding_matrix],
input_length=len(X_train['asset_text_seq_pad'].tolist()[0]),
trainable=True
)
lstm_1 = common_embed(input_1)
common_lstm = LSTM(64, input_shape=(100,2))
...

对于嵌入,我使用Glove作为预训练的嵌入字典。我首先使用以下内容构建标记器和文本序列:t=Tokenizer()t.fit_on_text(all_text)

text_seq= pad_sequences(t.texts_to_sequences(data['example_texts'].astype(str).values))

然后我用计算嵌入矩阵

embeddings_index = {}
for line in new_byte_string.decode('utf-8').split('n'):
if line:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs

embedding_vector = None
not_present_list = []
vocab_size = len(t.word_index) + 1
print('Loaded %s word vectors.' % len(embeddings_index))
embedding_matrix = np.zeros((vocab_size, len(embeddings_index['no'])))
for word, i in t.word_index.items():
if word in embeddings_index.keys():
embedding_vector = embeddings_index.get(word)
else:
not_present_list.append(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
else:
embedding_matrix[i] = np.zeros(300)

现在我使用一个新的数据集进行预测。这导致了一个错误:

节点:'model/synopsis_embedded/embedding_lookup'索引[38666,63]=136482不在[0129872中)[[{节点模型/同义词_嵌入/嵌入_库}}]][操作:__推理_预测_功能_12452]

我再次执行预测步骤的所有步骤。这是错误的吗?我必须重复使用训练中的标记器吗?或者为什么预测期间的指数不存在?

您可能会出现此错误,因为您在推理过程中没有使用相同的tokenizerembedding_matrix。这里有一个例子:

import tensorflow as tf
vocab_size = 50
embedding_layer = tf.keras.layers.Embedding(vocab_size, 64, input_length=10)
sequence1 = tf.constant([[1, 2, 5, 10, 32]])
embedding_layer(sequence1) # This works
sequence2 = tf.constant([[51, 2, 5, 10, 32]])
embedding_layer(sequence2) # This throws an error because 51 is larger than the vocab_size=50

最新更新