Tensorflow:GPU设备之间的内存增长没有区别|如何使用Tensorflow的多GPU



我正试图在集群内的GPU节点上运行keras代码。GPU节点每个节点有4个GPU。我确保GPU节点中的所有4个GPU都可供我使用。我运行下面的代码让tensorflow使用GPU:

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
try:
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
logical_gpus = tf.config.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
print(e)

输出中列出了可用的4个GPU。然而,我在运行代码时出现了以下错误:

Traceback (most recent call last):
File "/BayesOptimization.py", line 20, in <module>
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
File "/.conda/envs/thesis/lib/python3.9/site-packages/tensorflow/python/framework/config.py", line 439, in list_logical_devices
return context.context().list_logical_devices(device_type=device_type)
File "/.conda/envs/thesis/lib/python3.9/site-packages/tensorflow/python/eager/context.py", line 1368, in list_logical_devices
self.ensure_initialized()
File "/.conda/envs/thesis/lib/python3.9/site-packages/tensorflow/python/eager/context.py", line 511, in ensure_initialized
config_str = self.config.SerializeToString()
File "/.conda/envs/thesis/lib/python3.9/site-packages/tensorflow/python/eager/context.py", line 1015, in config
gpu_options = self._compute_gpu_options()
File "/.conda/envs/thesis/lib/python3.9/site-packages/tensorflow/python/eager/context.py", line 1074, in _compute_gpu_options
raise ValueError("Memory growth cannot differ between GPU devices")
ValueError: Memory growth cannot differ between GPU devices

代码不应该列出所有可用的gpu并将每个gpu的内存增长设置为true吗?

我目前正在使用tensorflow库和python 3.97:

tensorflow                2.4.1           gpu_py39h8236f22_0
tensorflow-base           2.4.1           gpu_py39h29c2da4_0
tensorflow-estimator      2.4.1              pyheb71bc4_0
tensorflow-gpu            2.4.1                h30adc30_0

知道问题是什么以及如何解决吗?提前感谢!

仅尝试:os.environment["CUDA_VISIBLE_DEVICES"]=";0";而不是tf.config.experimental.set_memory_growth。这对我有用。

在多GPU设备的情况下,所有可用GPU的内存增长都应该是恒定的。要么为所有GPU设置true,要么保持false。

gpus = tf.config.list_physical_devices('GPU')
if gpus:
try:
# Currently, memory growth needs to be the same across GPUs
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
logical_gpus = tf.config.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
# Memory growth must be set before GPUs have been initialized
print(e)

Tensorflow GPU文档

最新更新