Tensorflow需要很长时间才能连接到Nvidia-Drivers



我已经在我的远程服务器上安装了我的 Nvidia 驱动程序,并且

$nvidia-smi还返回设备列表

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.48                 Driver Version: 410.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:01:00.0 Off |                  N/A |
| 36%   64C    P2    78W / 300W |   9813MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:02:00.0 Off |                  N/A |
|  0%   45C    P8     8W / 300W |    774MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0       333      C   python                                      9039MiB |
|    0     11773      C   python                                       191MiB |
|    0     12111      C   python                                       191MiB |
|    0     12591      C   python                                       191MiB |
|    0     12999      C   python                                       191MiB |
|    1     11773      C   python                                       191MiB |
|    1     12111      C   python                                       191MiB |
|    1     12591      C   python                                       191MiB |
|    1     12999      C   python                                       191MiB |
+-----------------------------------------------------------------------------+

但是当我尝试使用时。

>>> from tensorflow.python.client import device_lib
>>> device_lib.list_local_devices()

给出设备列表需要很长时间

Python 3.7.3 (default, Mar 27 2019, 22:11:17)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from tensorflow.python.client import device_lib
>>> device_lib.list_local_devices()
2019-09-05 15:56:27.568310: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-09-05 15:56:27.596133: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3600000000 Hz
2019-09-05 15:56:27.597408: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55da9a2fe370 executing computations on platform Host. Devices:
2019-09-05 15:56:27.597421: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2019-09-05 15:56:27.598606: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2019-09-05 15:59:27.755714: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-05 15:59:27.757588: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-05 15:59:27.758287: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55da9c6d9140 executing computations on platform CUDA. Devices:
2019-09-05 15:59:27.758299: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): GeForce RTX 2080 Ti, Compute Capability 7.5
2019-09-05 15:59:27.758303: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (1): GeForce RTX 2080 Ti, Compute Capability 7.5
2019-09-05 15:59:27.758508: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-05 15:59:27.758883: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.755
pciBusID: 0000:01:00.0
2019-09-05 15:59:27.758919: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-05 15:59:27.759313: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 1 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.755
pciBusID: 0000:02:00.0
2019-09-05 15:59:27.759441: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-09-05 15:59:27.760153: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-09-05 15:59:27.760803: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2019-09-05 15:59:27.760941: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2019-09-05 15:59:27.761755: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2019-09-05 15:59:27.762395: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2019-09-05 15:59:27.764370: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-09-05 15:59:27.764435: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-05 15:59:27.764846: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-05 15:59:27.765262: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-05 15:59:27.765638: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-05 15:59:27.766026: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0, 1
2019-09-05 15:59:27.766046: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-09-05 15:59:27.767185: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-09-05 15:59:27.767194: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0 1
2019-09-05 15:59:27.767197: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N Y
2019-09-05 15:59:27.767200: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 1:   Y N
2019-09-05 15:59:27.767370: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-05 15:59:27.767755: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-05 15:59:27.768175: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-05 15:59:27.768546: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/device:GPU:0 with 950 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5)
2019-09-05 15:59:27.768811: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-05 15:59:27.769219: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/device:GPU:1 with 9703 MB memory) -> physical GPU (device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:02:00.0, compute capability: 7.5)
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 14451063918691325445
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 3472161188084064797
physical_device_desc: "device: XLA_CPU device"
, name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 1975372846861552523
physical_device_desc: "device: XLA_GPU device"
, name: "/device:XLA_GPU:1"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 8192574289833793917
physical_device_desc: "device: XLA_GPU device"
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 996802560
locality {
bus_id: 1
links {
link {
device_id: 1
type: "StreamExecutor"
strength: 1
}
}
}
incarnation: 9560317274260358344
physical_device_desc: "device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5"
, name: "/device:GPU:1"
device_type: "GPU"
memory_limit: 10175270093
locality {
bus_id: 1
links {
link {
type: "StreamExecutor"
strength: 1
}
}
}
incarnation: 16877706551631715197
physical_device_desc: "device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:02:00.0, compute capability: 7.5"

当我尝试检查信息以使用 nvidia-docker

sudo nvidia-container-cli -k -d /dev/tty info 

它给出以下输出:

I0905 09:00:43.600569 9993 driver.c:233] driver service terminated with signal 15
nvidia-container-cli: initialization error: driver error: timed out

似乎我的 nvlink 有问题,

因此,如果您的训练需要太多时间才能开始。

请检查链接是否打开而不是关闭。

以下命令如下。

$ nvidia-smi nvlink --status
GPU 0: GeForce RTX 2080 Ti (UUID: GPU-797d7153-ea28-d678-dc38-859b914d6dd7)
Link 0: 25.781 GB/s
Link 1: 25.781 GB/s
GPU 1: GeForce RTX 2080 Ti (UUID: GPU-8807c553-7571-582d-c2ee-02993527b0a6)
Link 0: 25.781 GB/s
Link 1: 25.781 GB/s

谢谢

最新更新