我使用带有特斯拉A100 GPU、TensorFlow Enterprise 2.5和CUDA 11.0的深度学习虚拟机部署了一台虚拟机。但我无法访问GPU/CUDA,并得到以下错误。
E tensorflow/stream_executor/cuda/cuda_driver.cc:328]调用失败cuInit:CUDA_ERROR_UNKNOWN:未知错误
在部署时,我收到了以下警告:
tensorflow有资源级别警告。资源"projects/click to deploy images/global/images/tf-2-5-cu110-v20210619-debian-10"已弃用。建议替换为"项目/点击部署images/global/images/tf-2-5-cu110-v20210624-debian-10"。
这是谷歌生成的一个已经存在的图像,很多人都在使用它,但为什么我不能使用它访问GPU或CUDA?
import tensorflow as tf
2021-07-05 17:05:14.901743: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
tf.__version__
'2.5.0'
print(tf.config.list_physical_devices())
2021-07-05 17:05:44.757638: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2021-07-05 17:05:44.840142: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
2021-07-05 17:05:44.840245: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: deeplearning-1-vm
2021-07-05 17:05:44.840258: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: deeplearning-1-vm
2021-07-05 17:05:44.841760: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 450.80.2
2021-07-05 17:05:44.841820: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 450.80.2
2021-07-05 17:05:44.841833: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 450.80.2
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]
以下详细信息可以帮助解决问题。
a_k@deeplearning-1-vm:~$ nvidia-smi
Mon Jul 5 17:03:43 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-SXM4-40GB Off | 00000000:00:04.0 Off | 0 |
| N/A 42C P0 56W / 400W | 0MiB / 40537MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
a_k@deeplearning-1-vm:~$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Thu_Jun_11_22:26:38_PDT_2020
Cuda compilation tools, release 11.0, V11.0.194
Build cuda_11.0_bu.TC445_37.28540450_0
a_k@deeplearning-1-vm:~$ cat /usr/local/cuda/version.txt
CUDA Version 11.0.207
问题是谷歌云平台提供的所有预建实例上的nvidia驱动程序、cuda和tensorflow版本不兼容(tf2.5需要cuda>=11.2(。我通过在预建实例(tensorflow enterprise 2.5,cuda 11.0(上重新安装最新版本的cuda来解决这个问题,现在即使重新启动实例也能正常工作。谷歌必须更新他们的预构建vm实例来解决
这次讨论帮助我找到了解决办法。为了重新安装CUDA,我没有卸载任何东西,只是按照这6条说明(针对debian 10(操作。虽然,我有Ubuntu 18.4,但它仍然有效。它还会询问您是否要卸载以前的cuda版本(是的!(。
现在,我有以下
a_k@a100-tfe25-vm:~$ nvidia-smi
Tue Jul 6 09:56:04 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01 Driver Version: 470.42.01 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... Off | 00000000:00:04.0 Off | 0 |
| N/A 38C P0 52W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
a_k@a100-tfe25-vm:~$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Wed_Jun__2_19:15:15_PDT_2021
Cuda compilation tools, release 11.4, V11.4.48
Build cuda_11.4.r11.4/compiler.30033411_0
a_k@a100-tfe25-vm:~$ python3
Python 3.7.10 | packaged by conda-forge | (default, Feb 19 2021, 16:07:37)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2021-07-06 09:57:08.277452: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
>>> tf.__version__
'2.5.0'
>>> tf.config.list_physical_devices()
2021-07-06 09:57:30.897584: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2021-07-06 09:57:31.689883: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:00:04.0 name: NVIDIA A100-SXM4-40GB computeCapability: 8.0
coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.59GiB deviceMemoryBandwidth: 1.41TiB/s
2021-07-06 09:57:31.689997: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-07-06 09:57:31.696712: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2021-07-06 09:57:31.696809: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
2021-07-06 09:57:31.699051: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
2021-07-06 09:57:31.699981: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
2021-07-06 09:57:31.734585: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.10
2021-07-06 09:57:31.735833: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.11
2021-07-06 09:57:31.738230: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2021-07-06 09:57:31.743485: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
根据谷歌云平台公共论坛中提供的修复程序,我们可以通过以下方式缓解问题:
- 修复#1:在新的VM实例中使用最新的DLVM映像(M74或更高版本(:他们已经发布了M74中最新DLVM映像的修复程序,因此您将不再受到此问题的影响
- 修复#2修补运行M74以上映像的现有实例:
通过受影响实例上的SSH会话运行以下操作:
gsutil cp gs://dl-platform-public-nvidia/b191551132/restart_patch.sh /tmp/restart_patch.sh
chmod +x /tmp/restart_patch.sh
sudo /tmp/restart_patch.sh
sudo service jupyter restart
这只需要执行一次,不需要每次重新启动实例时都重新运行。