带有fasttext单词嵌入的Keras模型

我正在尝试学习一个语言模型，在给定所有先前单词的情况下，使用keras来预测句子的最后一个单词。我想使用学习过的fasttext嵌入模型来嵌入我的输入。

我设法预处理了我的文本数据，并使用fasttext嵌入。我的训练数据由句子组成，每个句子有40个标记。我创建了两个np数组，X和y作为输入，y是我想要预测的。

X是形状(4431713900(，44317是示例句子的数量，39是每个句子中的标记的数量，300是单词嵌入的维度。

y的形状(44317300(对于每一个例子都是句子的最后一个标记的嵌入。

我的keras模型代码如下(受此启发(

#importing all the needed tensorflow.keras components
model = Sequential()  
model.add(InputLayer((None, 300)))
model.add(LSTM(100, return_sequences=True))
model.add(LSTM(100))
model.add(Dense(100, activation='relu'))
model.add(Dense(300, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X, y, batch_size=128, epochs=20)
model.save('model.h5')

然而，我在这个模型上训练时得到的准确率非常低(约1.5%(。我认为keras模型中有一些组件是我误用的，就好像我没有嵌入我的输入并添加额外的嵌入层而不是InputLayer一样。我得到了约60%的准确率。

我主要怀疑的是"；300〃；在我的第二个密集层上，当我读到这应该对应于我的单词嵌入模型的词汇表大小(48000(时，然而，如果我放300以外的东西，我会得到一个维度错误。所以我知道我做错了什么，但我找不到如何修复它。

PS：我还尝试了y = to_categorical(y, num_classes=vocab_size)，其中vocab_size是我嵌入单词的词汇表大小，通过在第二个Dense中将300更改为相同的值，然后它试图创建一个形状数组(1329510048120(，而不是我期望的：(4431748120(。

如果你真的想使用Fasttext中的单词向量，你必须使用权重矩阵和Embedding层将它们合并到你的模型中。嵌入层的目标是将表示句子的每个整数序列映射到其对应的300维向量表示：

import gensim.downloader as api
import numpy as np
import tensorflow as tf
def load_doc(filename):
file = open(filename, 'r')
text = file.read()
file.close()
return text
fasttext = api.load("fasttext-wiki-news-subwords-300")
embedding_dim = 300
in_filename = 'data.txt'
doc = load_doc(in_filename)
lines = doc.split('n')
tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(lines)
text_sequences = tokenizer.texts_to_sequences(lines)
text_sequences = tf.keras.preprocessing.sequence.pad_sequences(text_sequences, padding='post')
vocab_size = len(tokenizer.word_index) + 1
text_sequences = np.array(text_sequences)
X, y = text_sequences[:, :-1], text_sequences[:, -1]
y = tf.keras.utils.to_categorical(y, num_classes=vocab_size)
max_length = X.shape[1]
weight_matrix = np.zeros((vocab_size, embedding_dim))
for word, i in tokenizer.word_index.items():
try:
embedding_vector = fasttext[word]
weight_matrix[i] = embedding_vector
except KeyError:
weight_matrix[i] = np.random.uniform(-5, 5, embedding_dim)
sentence_input = tf.keras.layers.Input(shape=(max_length,))
x = tf.keras.layers.Embedding(vocab_size, embedding_dim, weights=[weight_matrix],
input_length=max_length)(sentence_input)
x = tf.keras.layers.LSTM(100, return_sequences=True)(x)
x = tf.keras.layers.LSTM(100)(x)
x = tf.keras.layers.Dense(100, activation='relu')(x)
output = tf.keras.layers.Dense(vocab_size, activation='softmax')(x)
model = tf.keras.Model(sentence_input, output)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X, y, batch_size=5, epochs=20)

请注意，我使用的是您链接的教程中的数据集和预处理步骤。

在下一句预测任务中训练RNN模型非常困难。LSTM/GRU没有足够的资源从文本中提取足够的特征。

有两种方法可以解决问题：

预测字符而不是单词类
使用变压器型号。例如，Bert擅长特征提取和掩蔽词预测

相关内容

最新更新

热门标签：