我试图在Tensorflow 2.4.1中获得GPU训练。我用的是Ubuntu 20.04,安装了Nvidia驱动460.32.03。我已经安装了CUDA工具箱11.2和cudn8。当启动tensorflow时,这是我看到的:
2021-01-21 16:23:31.457304: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-01-21 16:23:33.535844: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-01-21 16:23:33.536650: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-01-21 16:23:33.566101: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:21:00.0 name: Quadro RTX 4000 computeCapability: 7.5
coreClock: 1.545GHz coreCount: 36 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 387.49GiB/s
2021-01-21 16:23:33.566157: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-01-21 16:23:33.571082: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-01-21 16:23:33.571162: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-01-21 16:23:33.588669: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-01-21 16:23:33.590407: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-01-21 16:23:33.592191: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-01-21 16:23:33.592668: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2021-01-21 16:23:33.592781: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory
2021-01-21 16:23:33.592790: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1757] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
nvidia-smi
看起来不错:
Thu Jan 21 16:31:51 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro RTX 4000 Off | 00000000:21:00.0 On | N/A |
| 30% 36C P8 11W / 125W | 570MiB / 7979MiB | 3% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 3089 G /usr/lib/xorg/Xorg 71MiB |
| 0 N/A N/A 4021 G /usr/lib/xorg/Xorg 216MiB |
| 0 N/A N/A 4153 G /usr/bin/gnome-shell 106MiB |
| 0 N/A N/A 4641 G ...gAAAAAAAAA --shared-files 29MiB |
| 0 N/A N/A 4827 G /usr/lib/rstudio/bin/rstudio 132MiB |
+-----------------------------------------------------------------------------+
我已经验证了libcudnn.so.8
存在于与其他CUDA库相同的文件夹中:
/usr/local/cuda-11.2/lib64$ ls -la libcud*
-rw-r--r-- 1 root root 845076 Jan 21 15:47 libcudadevrt.a
lrwxrwxrwx 1 root root 17 Jan 21 15:47 libcudart.so -> libcudart.so.11.0
lrwxrwxrwx 1 root root 20 Jan 21 15:47 libcudart.so.11.0 -> libcudart.so.11.2.72
-rwxr-xr-x 1 root root 582008 Jan 21 15:47 libcudart.so.11.2.72
-rw-r--r-- 1 root root 906670 Jan 21 15:47 libcudart_static.a
lrwxrwxrwx 1 root root 23 Jan 21 16:19 libcudnn_adv_infer.so -> libcudnn_adv_infer.so.8
lrwxrwxrwx 1 root root 27 Jan 21 16:19 libcudnn_adv_infer.so.8 -> libcudnn_adv_infer.so.8.0.5
-rwxr-xr-x 1 root root 144525080 Jan 21 16:19 libcudnn_adv_infer.so.8.0.5
lrwxrwxrwx 1 root root 23 Jan 21 16:19 libcudnn_adv_train.so -> libcudnn_adv_train.so.8
lrwxrwxrwx 1 root root 27 Jan 21 16:19 libcudnn_adv_train.so.8 -> libcudnn_adv_train.so.8.0.5
-rwxr-xr-x 1 root root 94896760 Jan 21 16:19 libcudnn_adv_train.so.8.0.5
lrwxrwxrwx 1 root root 23 Jan 21 16:19 libcudnn_cnn_infer.so -> libcudnn_cnn_infer.so.8
lrwxrwxrwx 1 root root 27 Jan 21 16:19 libcudnn_cnn_infer.so.8 -> libcudnn_cnn_infer.so.8.0.5
-rwxr-xr-x 1 root root 1438587968 Jan 21 16:19 libcudnn_cnn_infer.so.8.0.5
lrwxrwxrwx 1 root root 23 Jan 21 16:19 libcudnn_cnn_train.so -> libcudnn_cnn_train.so.8
lrwxrwxrwx 1 root root 27 Jan 21 16:19 libcudnn_cnn_train.so.8 -> libcudnn_cnn_train.so.8.0.5
-rwxr-xr-x 1 root root 89274264 Jan 21 16:19 libcudnn_cnn_train.so.8.0.5
lrwxrwxrwx 1 root root 23 Jan 21 16:19 libcudnn_ops_infer.so -> libcudnn_ops_infer.so.8
lrwxrwxrwx 1 root root 27 Jan 21 16:19 libcudnn_ops_infer.so.8 -> libcudnn_ops_infer.so.8.0.5
-rwxr-xr-x 1 root root 333101688 Jan 21 16:19 libcudnn_ops_infer.so.8.0.5
lrwxrwxrwx 1 root root 23 Jan 21 16:19 libcudnn_ops_train.so -> libcudnn_ops_train.so.8
lrwxrwxrwx 1 root root 27 Jan 21 16:19 libcudnn_ops_train.so.8 -> libcudnn_ops_train.so.8.0.5
-rwxr-xr-x 1 root root 37388984 Jan 21 16:19 libcudnn_ops_train.so.8.0.5
lrwxrwxrwx 1 root root 13 Jan 21 16:19 libcudnn.so -> libcudnn.so.8
lrwxrwxrwx 1 root root 17 Jan 21 16:19 libcudnn.so.8 -> libcudnn.so.8.0.5
-rwxr-xr-x 1 root root 158264 Jan 21 16:19 libcudnn.so.8.0.5
-rw-r--r-- 1 root root 2428480120 Jan 21 16:19 libcudnn_static.a
并且库看起来加载正常并且没有丢失任何依赖项:
$ ldd libcudnn.so.8
linux-vdso.so.1 (0x00007ffe41739000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f652d78a000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f652d767000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f652d761000)
libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f652d580000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f652d565000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f652d371000)
/lib64/ld-linux-x86-64.so.2 (0x00007f652d9db000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f652d222000)
我还能错过什么?
我有同样的问题,经过一些尝试/失败,我找到了我的修复。修复方法是通过执行以下命令将该路径添加到path变量中:
$ export PATH=/usr/local/cuda-11.4/bin${PATH:+:${PATH}}
$ export LD_LIBRARY_PATH=/usr/local/cuda-11.4/lib64
${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
这是CUDA设置的9.1.1节(https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#environment-setup)