Keras 词嵌入矩阵的第一行为零

我正在查看 Keras 手套词嵌入示例，不清楚为什么嵌入矩阵的第一行填充为零。

首先，创建嵌入索引，其中单词与数组相关联。

embeddings_index = {}
with open(os.path.join(GLOVE_DIR, 'glove.6B.100d.txt')) as f:
for line in f:
word, coefs = line.split(maxsplit=1)
coefs = np.fromstring(coefs, 'f', sep=' ')
embeddings_index[word] = coefs

然后，通过查看分词器创建的索引中的单词来创建嵌入矩阵。

# prepare embedding matrix
num_words = min(MAX_NUM_WORDS, len(word_index) + 1)
embedding_matrix = np.zeros((num_words, EMBEDDING_DIM))
for word, i in word_index.items():
if i >= MAX_NUM_WORDS:
continue
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
# words not found in embedding index will be all-zeros.
embedding_matrix[i] = embedding_vector

由于循环将以i=1开头，因此如果矩阵初始化不同，则第一行将只包含零和随机数。跳过第一行有什么理由吗？

整体开始于Tokenizer的程序员出于某种原因保留索引0，也许是出于某种兼容性(其他一些语言使用1索引(或编码技术的原因。

但是，他们使用numpy，他们想要使用简单的索引：

embedding_matrix[i] = embedding_vector

索引，因此[0]索引行保持充满零，并且不存在">如果矩阵初始化方式不同则为随机数">的情况，因为该数组已用零初始化。因此，从这一行开始，我们根本不需要第一行，但您无法删除它，因为 numpy 数组会失去将其索引与分词器的索引对齐。

相关内容

最新更新

热门标签：