如何打印 Keras 的 model.fit() 期间使用的最大内存



我使用KerasTensorflow编写了一个神经网络模型,并能够对其进行训练和运行。此时,我想知道训练模型需要多少内存。如何在培训阶段打印此信息?我尝试了下面的Keras模型探查器,但它没有解释训练阶段所需的峰值内存。例如,训练我的模型显示6GB GPU卡内存不足,但配置文件显示内存需求低于1GB。那么,当我在Keras中使用model.fit()时,如何测量峰值运行时内存需求?

https://github.com/Mr-TalhaIlyas/Tensorflow-Keras-Model-Profiler

例如,我建议使用Keras Callback并在每个epoch后打印GPU使用情况。您可以使用tf.config.experimental.get_memory_info('GPU:0')获取GPU信息。下面是一个工作示例:

import tensorflow as tf
class MemoryPrintingCallback(tf.keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs=None):
gpu_dict = tf.config.experimental.get_memory_info('GPU:0')
tf.print('n GPU memory details [current: {} gb, peak: {} gb]'.format(
float(gpu_dict['current']) / (1024 ** 3), 
float(gpu_dict['peak']) / (1024 ** 3)))

inputs = tf.keras.layers.Input((1000,))
x = tf.keras.layers.Dense(1000, 'relu')(inputs)
x = tf.keras.layers.Dense(1000, 'relu')(x)
x = tf.keras.layers.Dense(1000, 'relu')(x)
x = tf.keras.layers.Dense(1000, 'relu')(x)
outputs = tf.keras.layers.Dense(1, 'sigmoid')(x)
model = tf.keras.Model(inputs, outputs)
model.compile(optimizer='adam', loss=tf.keras.losses.BinaryCrossentropy())
x = tf.random.normal((500, 1000))
y = tf.random.uniform((500, 1), maxval=2, dtype=tf.int32)
model.fit(x, y, batch_size=50, epochs = 20, callbacks= [MemoryPrintingCallback()])
GPU memory details [current: 0.321030855178833 gb, peak: 0.32660841941833496 gb]
Epoch 1/20
10/10 [==============================] - 1s 8ms/step - loss: 0.9309
GPU memory details [current: 0.3508758544921875 gb, peak: 0.3557243347167969 gb]
Epoch 2/20
10/10 [==============================] - 0s 7ms/step - loss: 0.5702
GPU memory details [current: 0.3508758544921875 gb, peak: 0.3557243347167969 gb]
Epoch 3/20
10/10 [==============================] - 0s 8ms/step - loss: 0.1311
GPU memory details [current: 0.3508758544921875 gb, peak: 0.3557243347167969 gb]
Epoch 4/20
10/10 [==============================] - 0s 7ms/step - loss: 0.0865
GPU memory details [current: 0.3508758544921875 gb, peak: 0.3661658763885498 gb]
Epoch 5/20
10/10 [==============================] - 0s 7ms/step - loss: 0.0379
...

你可以用找到你的设备名称

print(tf.config.list_physical_devices('GPU'))
#[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

但请注意以下内容:

对于GPU,TensorFlow将默认分配所有内存,除非使用tf.config.experimental.set_memory_rowth进行更改。dict仅指定TensorFlow实际使用的当前和峰值内存,而不是TensorFlow在GPU上分配的内存。源

最新更新