Anaconda (Win10) Tensorflow突然不识别GPU (CUDA_ERROR_NO_DEVICE).&



我通过Anaconda安装了tensorflow。它工作得很好,识别GPU已经有一段时间了。但突然,因为几天前没有环境与tensorflow识别我的GPU了。有人知道要检查什么吗?

我试过了:

  • 使用python=3.7创建新环境并安装tensorflow-gpu=2.1
  • 重新安装蟒蛇
  • 使用python=3.6创建新环境并安装tensorflow-gpu=1.9
  • 安装tensorflow-gpu=2.3,安装missing cudatoolkit=10.1 and cudnn=7.6
  • 安装tensorflow-gpu与特定的构建号根据开放github问题
  • 我通过python (TensorFlow: failed call to cuInit: CUDA_ERROR_NO_DEVICE)将环境变量CUDA_VISIBLE_DEVICES设置为0
  • 我更新了我的图形驱动程序
  • 删除修改的注册表项HKEY_LOCAL_MACHINESYSTEMCurrentControlSetControlGraphicsDriversTdrDelay

检查可识别设备的测试脚本:

import tensorflow as tf
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())

这是我在每个配置中得到的输出:

> python check.py
2021-03-10 18:48:12.880629: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2021-03-10 18:48:14.637784: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
2021-03-10 18:48:19.201572: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2021-03-10 18:48:19.705910: E tensorflow/stream_executor/cuda/cuda_driver.cc:351] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2021-03-10 18:48:19.715756: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: NB-170
2021-03-10 18:48:19.721085: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: NB-170
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 10539449374211484676
]

系统信息
  • 操作系统:Windows 10 Pro (Version 10.0.18363 Build 18363)
  • 显卡:NVIDIA GeForce GTX 1650
  • 蟒蛇1.10
  • 将注册表:HKEY_LOCAL_MACHINESYSTEMCurrentControlSetControlGraphicsDriversTdrDelay改为15以训练Matterport的掩码r-cnn实现
  • 图形驱动程序- GEFORCE GAME READY Driver -版本:461.72 WHQL;上映日期:2021.2.25;操作系统:Windows 10 64位;语言:英语

我的nvidia -smi输出:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 461.72       Driver Version: 461.72       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 1650   WDDM  | 00000000:01:00.0 Off |                  N/A |
| N/A   54C    P8     6W /  N/A |    132MiB /  4096MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

更新1 (2021-03-14)

我安装了一个新的Anaconda安装并在我的另一台计算机上创建了一个环境(conda create -name tf-gpu tensorflow-gpu=2.1)。在那台机器上,我的gpu没有任何问题。

2021-03-14 14:21:33.934222: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2021-03-14 14:21:37.608844: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX
2021-03-14 14:21:37.612173: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2021-03-14 14:21:37.658982: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce GTX 970 computeCapability: 5.2
coreClock: 1.253GHz coreCount: 13 deviceMemorySize: 4.00GiB deviceMemoryBandwidth: 208.91GiB/s
2021-03-14 14:21:37.659525: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
2021-03-14 14:21:38.216002: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2021-03-14 14:21:38.625300: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_10.dll
2021-03-14 14:21:38.660856: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_10.dll
2021-03-14 14:21:38.971988: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_10.dll
2021-03-14 14:21:39.247585: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_10.dll
2021-03-14 14:21:39.564512: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2021-03-14 14:21:39.565268: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2021-03-14 14:21:41.272007: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-03-14 14:21:41.272272: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102]      0 
2021-03-14 14:21:41.272582: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0:   N
2021-03-14 14:21:41.283835: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/device:GPU:0 with 2993 MB memory) -> physical GPU (device: 0, name: GeForce GTX 970, pci bus id: 0000:01:00.0, compute capability: 5.2)
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 17009642916451828901
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 3139148187
locality {
bus_id: 1
links {
}
}
incarnation: 5677250807137925801
physical_device_desc: "device: 0, name: GeForce GTX 970, pci bus id: 0000:01:00.0, compute capability: 5.2"
]

在我的情况下,我得到同样的错误:failed call to cuinit: CUDA_ERROR_NO_DEVICE。然而,nvidia-smi.exe正在检测gpu。我的系统(Windows 10)安装了CUDA 9.0。然后我意识到我不小心在我的应用程序路径中有一个CUDA 10.0版本的dll nvcuda.dll。从我的应用程序路径中删除这个dll解决了问题。

相关内容

  • 没有找到相关文章

最新更新