g4dn.xlarge GPU 上的 tensorflow 2 在 8 个 epoch 后崩溃



我正在尝试在g4dn.xlarge GPU ec2机器上训练cGAN,每次在8个epoch后它都会崩溃,并显示以下消息:

Traceback (most recent call last):
File "pix2pix_tf2.py", line 841, in <module>
main()
File "pix2pix_tf2.py", line 802, in main
results = sess.run(fetches, options=options, run_metadata=run_metadata)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 958, in run
run_metadata_ptr)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1181, in _run
feed_dict_tensor, options, run_metadata)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: 2 root error(s) found.
(0) Invalid argument: During Variant Host->Device Copy: non-DMA-copy attempted of tensor type: string
(1) Invalid argument: During Variant Host->Device Copy: non-DMA-copy attempted of tensor type: string
0 successful operations.
0 derived errors ignored.
[[{{node TensorArrayV2Write/TensorListSetItem}}]]
(1) Invalid argument: 2 root error(s) found.
(0) Invalid argument: During Variant Host->Device Copy: non-DMA-copy attempted of tensor type: string
(1) Invalid argument: During Variant Host->Device Copy: non-DMA-copy attempted of tensor type: string
0 successful operations.
0 derived errors ignored.
[[{{node TensorArrayV2Write/TensorListSetItem}}]]
[[Func/encode_images/target_pngs/while/body/_47/input/_154/_773]]
0 successful operations.
0 derived errors ignored.

环境规格: 张量流 2.2.0 库达 V10.0.130 库德恩 7.6.5

将 CUDA 更新到 10.1 解决了这个问题

最新更新