我通过Anaconda安装了tensorflow。它工作得很好,识别GPU已经有一段时间了。但突然,因为几天前没有环境与tensorflow识别我的GPU了。有人知道要检查什么吗?
我试过了:
- 使用python=3.7创建新环境并安装tensorflow-gpu=2.1
- 重新安装蟒蛇
- 使用python=3.6创建新环境并安装tensorflow-gpu=1.9
- 安装tensorflow-gpu=2.3,安装missing cudatoolkit=10.1 and cudnn=7.6
- 安装tensorflow-gpu与特定的构建号根据开放github问题
- 我通过python (TensorFlow: failed call to cuInit: CUDA_ERROR_NO_DEVICE)将环境变量
CUDA_VISIBLE_DEVICES
设置为0 - 我更新了我的图形驱动程序
- 删除修改的注册表项
HKEY_LOCAL_MACHINESYSTEMCurrentControlSetControlGraphicsDriversTdrDelay
检查可识别设备的测试脚本:
import tensorflow as tf
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())
这是我在每个配置中得到的输出:
> python check.py
2021-03-10 18:48:12.880629: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2021-03-10 18:48:14.637784: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
2021-03-10 18:48:19.201572: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2021-03-10 18:48:19.705910: E tensorflow/stream_executor/cuda/cuda_driver.cc:351] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2021-03-10 18:48:19.715756: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: NB-170
2021-03-10 18:48:19.721085: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: NB-170
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 10539449374211484676
]
系统信息- 操作系统:Windows 10 Pro (Version 10.0.18363 Build 18363)
- 显卡:NVIDIA GeForce GTX 1650
- 蟒蛇1.10
- 将注册表:
HKEY_LOCAL_MACHINESYSTEMCurrentControlSetControlGraphicsDriversTdrDelay
改为15以训练Matterport的掩码r-cnn实现 - 图形驱动程序- GEFORCE GAME READY Driver -版本:461.72 WHQL;上映日期:2021.2.25;操作系统:Windows 10 64位;语言:英语
我的nvidia -smi输出:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 461.72 Driver Version: 461.72 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 1650 WDDM | 00000000:01:00.0 Off | N/A |
| N/A 54C P8 6W / N/A | 132MiB / 4096MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
更新1 (2021-03-14)
我安装了一个新的Anaconda安装并在我的另一台计算机上创建了一个环境(conda create -name tf-gpu tensorflow-gpu=2.1
)。在那台机器上,我的gpu没有任何问题。
2021-03-14 14:21:33.934222: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2021-03-14 14:21:37.608844: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX
2021-03-14 14:21:37.612173: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2021-03-14 14:21:37.658982: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce GTX 970 computeCapability: 5.2
coreClock: 1.253GHz coreCount: 13 deviceMemorySize: 4.00GiB deviceMemoryBandwidth: 208.91GiB/s
2021-03-14 14:21:37.659525: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2021-03-14 14:21:38.216002: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2021-03-14 14:21:38.625300: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2021-03-14 14:21:38.660856: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2021-03-14 14:21:38.971988: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2021-03-14 14:21:39.247585: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2021-03-14 14:21:39.564512: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2021-03-14 14:21:39.565268: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2021-03-14 14:21:41.272007: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-03-14 14:21:41.272272: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] 0
2021-03-14 14:21:41.272582: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0: N
2021-03-14 14:21:41.283835: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/device:GPU:0 with 2993 MB memory) -> physical GPU (device: 0, name: GeForce GTX 970, pci bus id: 0000:01:00.0, compute capability: 5.2)
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 17009642916451828901
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 3139148187
locality {
bus_id: 1
links {
}
}
incarnation: 5677250807137925801
physical_device_desc: "device: 0, name: GeForce GTX 970, pci bus id: 0000:01:00.0, compute capability: 5.2"
]
在我的情况下,我得到同样的错误:failed call to cuinit: CUDA_ERROR_NO_DEVICE
。然而,nvidia-smi.exe正在检测gpu。我的系统(Windows 10)安装了CUDA 9.0。然后我意识到我不小心在我的应用程序路径中有一个CUDA 10.0版本的dll nvcuda.dll。从我的应用程序路径中删除这个dll解决了问题。