枢纽.KerasLayer() 始终占用相同的 GPU 内存,尽管max_seq_len在变化



我正在使用来自 tensorflow hub的 Bert,在原始 Bert 存储库中注意到这一点后,我想通过减少 Bert 模型的max_seq_len来节省 GPU 内存:

max_seq_length:已发布的模型使用高达 512 的序列长度进行训练,但您可以使用较短的最大序列长度进行微调以节省大量内存。这由示例代码中的 max_seq_length 标志控制。

但是在我的测试中,尽管max_seq_len发生变化,但 Bert 模型始终消耗相同的 GPU 内存。这是我的测试脚本。

import numpy as np
import tensorflow_hub as hub
import tensorflow as tf
num_sample = 1000
batch_size = 10
max_seq_len = 512
num_class = 30
vocab_num = 30000
epochs = 100
learning_rate = 1e-5
# get the pooled_output of Bert and pass it to a dense layer
def bert_model():
input_ids = tf.keras.Input((max_seq_len,), dtype=tf.int32, name='input_ids')
input_masks = tf.keras.Input((max_seq_len,), dtype=tf.int32, name='input_masks')
input_segments = tf.keras.Input((max_seq_len,), dtype=tf.int32, name='input_segments')
bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1",
trainable=True)
pooled_output, sequence_output = bert_layer([input_ids, input_masks, input_segments])
out = tf.keras.layers.Dense(num_class, activation="sigmoid", name="dense_output")(pooled_output)
model = tf.keras.models.Model(inputs=[input_ids, input_masks, input_segments], outputs=out)
return model
outputs = np.random.randn(num_sample, num_class)
inputs = [np.random.randint(vocab_num, size=(num_sample, max_seq_len), dtype=np.int32),  # ids
np.ones((num_sample, max_seq_len), dtype=np.int32),  # masks
np.zeros((num_sample, max_seq_len), dtype=np.int32)]  # segments
model = bert_model()
print(model.summary())
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)
model.compile(loss='binary_crossentropy', optimizer=optimizer)  # multi-lebel task
model.fit(inputs, outputs, epochs=epochs, verbose=1, batch_size=batch_size)

max_seq_len512并且我通过键入 GPU 1 在 GPU 1 上运行脚本时CUDA_VISIBLE_DEVICES=1 python bert_test.py,我得到以下结果。

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to
==================================================================================================
input_ids (InputLayer)          [(None, 512)]        0
__________________________________________________________________________________________________
input_masks (InputLayer)        [(None, 512)]        0
__________________________________________________________________________________________________
input_segments (InputLayer)     [(None, 512)]        0
__________________________________________________________________________________________________
keras_layer (KerasLayer)        [(None, 768), (None, 109482241   input_ids[0][0]
input_masks[0][0]
input_segments[0][0]
__________________________________________________________________________________________________
dense_output (Dense)            (None, 30)           23070       keras_layer[0][0]
==================================================================================================
Total params: 109,505,311
Trainable params: 109,505,310
Non-trainable params: 1
__________________________________________________________________________________________________
None
Train on 1000 samples
Epoch 1/100
2019-12-26 08:54:44.071737: W tensorflow/core/common_runtime/shape_refiner.cc:89] Function instantiation has undefined input shape at index: 1211 in the outer inference context.
2019-12-26 08:54:45.962313: W tensorflow/core/common_runtime/shape_refiner.cc:89] Function instantiation has undefined input shape at index: 1211 in the outer inference context.
2019-12-26 08:54:57.818644: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
900/1000 [==========================>...] - ETA: 8s - loss: 0.2933

命令nvidia-smi告诉我进度占用了 GPU 1 的10765MiB

Every 0.5s: nvidia-smi                                                                                                                                                          Thu Dec 26 08:56:22 2019
Thu Dec 26 08:56:22 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.64       Driver Version: 430.64       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:02:00.0 Off |                  N/A |
| 46%   77C    P2    82W / 250W |  10895MiB / 11178MiB |     10%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:03:00.0 Off |                  N/A |
| 58%   86C    P2   195W / 250W |  10765MiB / 11178MiB |     98%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 108...  Off  | 00000000:82:00.0 Off |                  N/A |
| 88%   86C    P2   150W / 250W |   5930MiB / 11178MiB |     92%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 108...  Off  | 00000000:83:00.0 Off |                  N/A |
| 23%   38C    P8     9W / 250W |    805MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     25551      C   python                                     10885MiB |
|    1     24838      C   python                                     10755MiB |
|    2      8663      C   python                                       395MiB |
|    2     28173      C   python                                      5525MiB |
|    3     15501      C   python                                       795MiB |
+-----------------------------------------------------------------------------+

然后,无论我使用什么max_seq_len,我都会得到相同的结果,即GPU内存的使用保持不变。例如,这是我使用max_seq_len=64时的输出。

模型求和和训练信息:


Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to
==================================================================================================
input_ids (InputLayer)          [(None, 64)]         0
__________________________________________________________________________________________________
input_masks (InputLayer)        [(None, 64)]         0
__________________________________________________________________________________________________
input_segments (InputLayer)     [(None, 64)]         0
__________________________________________________________________________________________________
keras_layer (KerasLayer)        [(None, 768), (None, 109482241   input_ids[0][0]
input_masks[0][0]
input_segments[0][0]
__________________________________________________________________________________________________
dense_output (Dense)            (None, 30)           23070       keras_layer[0][0]
==================================================================================================
Total params: 109,505,311
Trainable params: 109,505,310
Non-trainable params: 1
__________________________________________________________________________________________________
None
Train on 1000 samples
Epoch 1/100
2019-12-26 08:58:01.458129: W tensorflow/core/common_runtime/shape_refiner.cc:89] Function instantiation has undefined input shape at index: 1211 in the outer inference context.
2019-12-26 08:58:03.176888: W tensorflow/core/common_runtime/shape_refiner.cc:89] Function instantiation has undefined input shape at index: 1211 in the outer inference context.
2019-12-26 08:58:14.005948: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
1000/1000 [==============================] - 29s 29ms/sample - loss: 0.3040
Epoch 2/100
280/1000 [=======>......................] - ETA: 6s - loss: 0.1366

以及 GPU 使用信息:

Every 0.5s: nvidia-smi                                                                                                                                                          Thu Dec 26 08:59:10 2019
Thu Dec 26 08:59:10 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.64       Driver Version: 430.64       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:02:00.0 Off |                  N/A |
| 46%   78C    P2   277W / 250W |  10895MiB / 11178MiB |     36%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:03:00.0 Off |                  N/A |
| 75%   86C    P2   222W / 250W |  10765MiB / 11178MiB |     93%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 108...  Off  | 00000000:82:00.0 Off |                  N/A |
| 88%   88C    P2   175W / 250W |   5930MiB / 11178MiB |     96%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 108...  Off  | 00000000:83:00.0 Off |                  N/A |
| 23%   39C    P8     9W / 250W |    805MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     25551      C   python                                     10885MiB |
|    1     29332      C   python                                     10755MiB |
|    2      8663      C   python                                       395MiB |
|    2     28173      C   python                                      5525MiB |
|    3     15501      C   python                                       795MiB |
+-----------------------------------------------------------------------------+

使用较小的max_seq_len时,训练确实更快,但我更关心内存使用情况。那么谁能帮我解决这个问题?任何建议将不胜感激!

我使用了Tensorflow文档中的代码并解决了这个问题。

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
try:
# Currently, memory growth needs to be the same across GPUs
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
# Memory growth must be set before GPUs have been initialized
print(e)

最新更新