LSTM 输出的串联



我正在尝试构建一个多任务图像字幕模型,该模型包含两个独立的编码器-解码器模型,每个模型都从不同的数据集获取输入,然后通过连接组合 lstms 的输出,然后串联层的输出传递给 Dense。 下面是一个模型代码:

def define_model(vocab_size1, max_length1, vocab_size2, max_length2):
# first
inputs1 = Input(shape=(4096,))
print(inputs1.shape)
fe1_1 = Dropout(0.5)(inputs1)
fe2_1 = Dense(EMBEDDING_DIM, activation='relu')(fe1_1)
fe3_1 = RepeatVector(max_length1)(fe2_1)
inputs2 = Input(shape=(max_length1,))
print(inputs2.shape)
emb2_1 = Embedding(vocab_size1, EMBEDDING_DIM, mask_zero=True)(inputs2)

merged1 = concatenate([fe3_1, emb2_1], name='concat1')
lm2_1 = LSTM(500, return_sequences=False)(merged1)
#second
inputs3 = Input(shape=(4096,))
fe1_2 = Dropout(0.5)(inputs3)
fe2_2 = Dense(EMBEDDING_DIM, activation='relu')(fe1_2)
fe3_2 = RepeatVector(max_length2)(fe2_2)

inputs4 = Input(shape=(max_length2,))
emb2_2 = Embedding(vocab_size2, EMBEDDING_DIM, mask_zero=True)(inputs4)

merged2 = concatenate([fe3_2, emb2_2], name='concat2')     
lm2_2 = LSTM(500, return_sequences=False)(merged2)

# merge
merged3 = concatenate([lm2_1, lm2_2], name='concat3') # error
outputs = Dense(vocab_size1, activation='softmax')(merged3)
outputs1 = Dense(vocab_size2, activation='softmax')(merged3)
# tie it together [image, seq] [word]
model = Model(inputs=[inputs1, inputs2, inputs3, inputs4], outputs=[outputs, outputs1])
model.compile(loss=['categorical_crossentropy', 'categorical_crossentropy'], optimizer='adam', metrics=['accuracy'])
print(model.summary())
# plot_model(model, show_shapes=True, to_file='model.png')
return model

我可以正确初始化它:

model = define_model(fvocab_size, fmax_length, wvocab_size, wmax_length)
Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_1 (InputLayer)            [(None, 4096)]       0                                            
__________________________________________________________________________________________________
input_3 (InputLayer)            [(None, 4096)]       0                                            
__________________________________________________________________________________________________
dropout (Dropout)               (None, 4096)         0           input_1[0][0]                    
__________________________________________________________________________________________________
dropout_1 (Dropout)             (None, 4096)         0           input_3[0][0]                    
__________________________________________________________________________________________________
dense (Dense)                   (None, 256)          1048832     dropout[0][0]                    
__________________________________________________________________________________________________
input_2 (InputLayer)            [(None, 34)]         0                                            
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, 256)          1048832     dropout_1[0][0]                  
__________________________________________________________________________________________________
input_4 (InputLayer)            [(None, 21)]         0                                            
__________________________________________________________________________________________________
repeat_vector (RepeatVector)    (None, 34, 256)      0           dense[0][0]                      
__________________________________________________________________________________________________
embedding (Embedding)           (None, 34, 256)      1940224     input_2[0][0]                    
__________________________________________________________________________________________________
repeat_vector_1 (RepeatVector)  (None, 21, 256)      0           dense_1[0][0]                    
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 21, 256)      1428992     input_4[0][0]                    
__________________________________________________________________________________________________
concat1 (Concatenate)           (None, 34, 512)      0           repeat_vector[0][0]              
embedding[0][0]                  
__________________________________________________________________________________________________
concat2 (Concatenate)           (None, 21, 512)      0           repeat_vector_1[0][0]            
embedding_1[0][0]                
__________________________________________________________________________________________________
lstm (LSTM)                     (None, 500)          2026000     concat1[0][0]                    
__________________________________________________________________________________________________
lstm_1 (LSTM)                   (None, 500)          2026000     concat2[0][0]                    
__________________________________________________________________________________________________
concat3 (Concatenate)           (None, 1000)         0           lstm[0][0]                       
lstm_1[0][0]                     
__________________________________________________________________________________________________
dense_2 (Dense)                 (None, 7579)         7586579     concat3[0][0]                    
__________________________________________________________________________________________________
dense_3 (Dense)                 (None, 5582)         5587582     concat3[0][0]                    
==================================================================================================
Total params: 22,693,041
Trainable params: 22,693,041
Non-trainable params: 0

串联的输入形状为(无,500),(无,500),输出为(无,1000)。但是,当通过生成器传递实际数据时,我收到一个错误:

`InvalidArgumentError                      Traceback (most recent call last)
<ipython-input-15-e52b85d1307b> in <module>()
12 
13 model.fit(train_generator, epochs=20,  verbose=1, steps_per_epoch=steps, validation_steps=val_steps,
---> 14     callbacks=[checkpoint], validation_data=val_generator)
15 
16 try:
6 frames
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq, max_queue_size, workers, use_multiprocessing)
1098                 _r=1):
1099               callbacks.on_train_batch_begin(step)
-> 1100               tmp_logs = self.train_function(iterator)
1101               if data_handler.should_sync:
1102                 context.async_wait()
/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py in __call__(self, *args, **kwds)
826     tracing_count = self.experimental_get_tracing_count()
827     with trace.Trace(self._name) as tm:
--> 828       result = self._call(*args, **kwds)
829       compiler = "xla" if self._experimental_compile else "nonXla"
830       new_tracing_count = self.experimental_get_tracing_count()
/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py in _call(self, *args, **kwds)
886         # Lifting succeeded, so variables are initialized and we can run the
887         # stateless function.
--> 888         return self._stateless_fn(*args, **kwds)
889     else:
890       _, _, _, filtered_flat_args = 
/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py in __call__(self, *args, **kwargs)
2941        filtered_flat_args) = self._maybe_define_function(args, kwargs)
2942     return graph_function._call_flat(
-> 2943         filtered_flat_args, captured_inputs=graph_function.captured_inputs)  # pylint: disable=protected-access
2944 
2945   @property
/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py in _call_flat(self, args, captured_inputs, cancellation_manager)
1917       # No tape is watching; skip to running the function.
1918       return self._build_call_outputs(self._inference_function.call(
-> 1919           ctx, args, cancellation_manager=cancellation_manager))
1920     forward_backward = self._select_forward_and_backward_functions(
1921         args,
/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py in call(self, ctx, args, cancellation_manager)
558               inputs=args,
559               attrs=attrs,
--> 560               ctx=ctx)
561         else:
562           outputs = execute.execute_with_cancellation(
/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/execute.py in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
58     ctx.ensure_initialized()
59     tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
---> 60                                         inputs, attrs, num_outputs)
61   except core._NotOkStatusException as e:
62     if name is not None:
InvalidArgumentError:  All dimensions except 1 must match. Input 1 has shape [4 500] and doesn't match input 0 with shape [47 500].
[[node gradient_tape/model/concat3/ConcatOffset (defined at <ipython-input-15-e52b85d1307b>:14) ]] [Op:__inference_train_function_14543]
Function call stack:
train_function`

生成器代码:

def create_sequences(tokenizer, max_length, desc_list, photo):
vocab_size = len(tokenizer.word_index) + 1
X1, X2, y = [], [], []
# walk through each description for the image
for desc in desc_list:
# encode the sequence
seq = tokenizer.texts_to_sequences([desc])[0]
# split one sequence into multiple X,y pairs
for i in range(1, len(seq)):
# split into input and output pair
in_seq, out_seq = seq[:i], seq[i]
# pad input sequence
in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
# encode output sequence
out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
# store
X1.append(photo)
X2.append(in_seq)
y.append(out_seq)
return np.array(X1), np.array(X2), np.array(y)

def double_generator(descriptions1, photos1, tokenizer1, max_length1,
descriptions2, photos2, tokenizer2, max_length2, n_step=1):
while True:
# loop over photo identifiers in the dataset
keys1 = list(descriptions1.keys())
keys2 = list(descriptions2.keys())    # len(keys1) == len(keys2)
for i in range(0, len(keys1), n_step):
Ximages1, XSeq1, y1 = list(), list(),list()
Ximages2, XSeq2, y2 = list(), list(),list()
for j in range(i, min(len(keys1), i+n_step)):
image_id1 = keys1[j]
# retrieve the photo feature
photo1 = photos1[image_id1][0]
desc_list1 = descriptions1[image_id1]
# print(desc_list)
in_img1, in_seq1, out_word1 = create_sequences(tokenizer1, max_length1, desc_list1, photo1)
# print(in_img, in_seq, out_word)
for k in range(len(in_img1)):
Ximages1.append(in_img1[k])
XSeq1.append(in_seq1[k])
y1.append(out_word1[k])
# print('Ximages1', Ximages1)
# print('Xseq1', XSeq1)
# print('y1', y1)
for j in range(i, min(len(keys2), i+n_step)):
image_id2 = keys2[j]
# retrieve the photo feature
photo2 = photos2[image_id2][0]
desc_list2 = descriptions2[image_id2]
# print(desc_list)
in_img2, in_seq2, out_word2 = create_sequences(tokenizer2, max_length2, desc_list2, photo2)
# print(in_img, in_seq, out_word)
for k in range(len(in_img2)):
Ximages2.append(in_img2[k])
XSeq2.append(in_seq2[k])
y2.append(out_word2[k])
# print('Ximages2', Ximages2)
# print('Xseq2', XSeq2)
# print('y2', y2)
yield ([np.array(Ximages1), np.array(XSeq1), np.array(Ximages2), np.array(XSeq2)], [np.array(y1), np.array(y2)])

当只有一个数据集并且没有 lstms 连接时,一切正常(带有简单的图像标题)

当我调用 next(generator) 时,错误输入的形状会发生变化,并且当我使用描述长度时,尽管我使用填充。

关于函数式 api 的 Keras 教程包含类似于我的示例,称为操作复杂的图形拓扑 https://keras.io/guides/functional_api/该示例也适用于 lstms 连接,我不明白为什么它在没有任何重塑的情况下在我的情况下不起作用。

我试过了:

  • 将连接更改为图层。连接
  • 在嵌入中将 mask_zero=True 更改为 False,
  • 为两组创建通用的分词器 数据集中的描述,
  • 将串联轴更改为 0(然后出现 对数的问题)。

提前致谢

TLDR;

您正在尝试通过生成器同时发送 47 个样本和 4 个不同输入的样本。神经网络抛出错误,因为您通过第一个通道传递它们,none该通道可以采用可变的批大小。但是,当来自 2 lstms 的张量形状 (47,500) 和 (4, 500) 到达连接层时,该层无法像预期的那样在第一个轴上连接它们。因此,您在训练时而不是编译时遇到错误。

如果您尝试通过生成器一次生成单个样本(1 行数据),那么您可能有形状为 (47,4096) 和 (4,4096) 的 2D 输入。在这种情况下,应将它们调整为 (1,47,4096) 和 (1,4,4096)。这将完全改变您的架构,但与我认为您要做的事情一致。

<小时 />

详情 -

问题是您将不同大小的批次作为输入传递给模型。这是因为第一个通道none采用批大小。

让我们逐步看一下模型中仅针对 2 个输入(Ximages1 和 Ximages2)会发生什么。

您第一次通过(对于来自生成器的每个批次)

输入层 -

input_1 (InputLayer) [(None, 4096)] #(47, 4096) Ximages1
input_3 (InputLayer) [(None, 4096)] #(4, 4096)  Ximages2

它们进入中间层,直到它们到达各个 LSTM。

LSTM 层 -

lstm (LSTM)   (None, 500) concat1[0][0] #(47, 500)              
lstm_1 (LSTM) (None, 500) concat2[0][0] #(4, 500)

现在下一层,连接尝试将 2 层合并为一个层,作为 -

concat3 (Concatenate) (None, 1000) lstm[0][0]  #(47, 500)                  
lstm_1[0][0] #(4, 500)

从架构的角度来看,它可以在第一个通道(batch_size)上连接(none, 500)和第二个(none, 500),但是,假设每个批次的层接收相同数量的样本。

换句话说,不能在第一个轴上连接(47, 500)(4,500)

  • 您可能需要重新考虑如何创建生成器输出批次。
  • 如果 (47, 4096) 和 (4, 4096
  • ) 应该是单个样本,您可能希望将它们输出为 3D 张量而不是 2D (1, 47, 4096) 和 (1, 4, 4096)。
  • 这样,输入图层将采用 (None, 47, 4096) 和 (None, 4, 4096)。
  • 这将相应地更改您随后的每一层,因为现在您必须使用额外的通道。

最新更新