1 台带有 2 个 GPU 的 PC。在 2 个 GPU 上训练 2 个独立的 CNN。我使用以下方法为 GPU 创建图形:
with tf.device('/gpu:%d' % self.single_gpu):
self._create_placeholders()
self._build_conv_net()
self._create_cost()
self._creat_optimizer()
训练循环不在 th.device(( 下
在开始第一个 CNN 训练过程后,例如使用 GPU 1。然后,我使用 GPU 0 开始第二个 CNN 训练。我总是收到CUDA_ERROR_OUT_OF_MEMORY错误,无法开始第二次训练过程。
可以在同一台 PC 上运行分配给 2 个 GPU 的 2 个独立训练任务吗?如果可能的话,我错过了什么?
E tensorflow/stream_executor/cuda/cuda_driver.cc:1002] failed to allocate 164.06M (172032000 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
W tensorflow/core/common_runtime/bfc_allocator.cc:274] ******__ W tensorflow/core/common_runtime/bfc_allocator.cc:275] 尝试分配 384.00MiB 时内存不足。 有关内存状态,请参阅日志。 回溯(最近一次调用(: 文件 "/home/hl/anaconda3/envs/dl-conda-py36/lib/python3.6/site-packages/tensorflow/python/client/session.py",第 1022 行,_do_call 行 返回 fn(*参数( 文件 "/home/hl/anaconda3/envs/dl-conda-py36/lib/python3.6/site-packages/tensorflow/python/client/session.py",第 1004 行,第 _run_fn 行 状态,run_metadata( 文件 "/home/hl/anaconda3/envs/dl-conda-py36/lib/python3.6/contextlib.py",第 89 行,在退出next(self.gen( 文件"/home/hl/anaconda3/envs/dl-conda-py36/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py",第 466 行,raise_exception_on_not_ok_status pywrap_tensorflow。TF_GetCode(状态(( tensorflow.python.framework.errors_impl。内部错误:Dst 张量未初始化。 [[节点: _recv_inputs/input_placeholder_0/_7 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:2", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_3__recv_inputs/input_placeholder_0", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:2"]] [[节点: 平均值/_15 = _Recvclient_terminated=false, recv_device
="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:2", send_device_incarnation=1, tensor_name="edge_414_Mean", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]]During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "mg_model_nvidia_gpu.py", line 491, in <module>
main()
File "mg_model_nvidia_gpu.py", line 482, in main
nvidia_cnn.train(data_generator, train_data, val_data)
File "mg_model_nvidia_gpu.py", line 307, in train
self.keep_prob: self.train_config.keep_prob})
File "/home/hl/anaconda3/envs/dl-conda-py36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 767, in run
run_metadata_ptr)
File "/home/hl/anaconda3/envs/dl-conda-py36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 965, in _run
feed_dict_string, options, run_metadata)
File "/home/hl/anaconda3/envs/dl-conda-py36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1015, in _do_run
target_list, options, run_metadata)
File "/home/hl/anaconda3/envs/dl-conda-py36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1035, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Dst tensor is not initialized.
[[Node: _recv_inputs/input_placeholder_0/_7 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:2", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_3__recv_inputs/input_placeholder_0", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:2"]()]]
[[Node: Mean/_15 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:2", send_device_incarnation=1, tensor_name="edge_414_Mean", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
默认情况下,TensorFlow 会预先分配它有权访问的 GPU 设备的整个内存。因此,没有内存可用于第二个进程。
您可以使用以下config.gpu_options
控制此分配:
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.4
sess = tf.Session(config=config) as sess:
或者,您可以使用os.environ["CUDA_VISIBLE_DEVICES"]
将不同的卡归因于您的两个进程。