Tensorflow和PyTorch在使用CUDA进行初始化时挂起



当我尝试运行一个非常小的Tensorflow示例时:

import tensorflow as tf
c = tf.constant([1,2,3])

系统永远挂起(至少十分钟(,没有任何迹象表明它在做什么。在这种状态下,它使用一个虚拟CPU核心的100%。当在Juypter笔记本中运行时,内核将其输出到控制台:

2020-03-31 11:12:04.840507: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2020-03-31 11:12:04.840576: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2020-03-31 11:12:04.840589: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2020-03-31 11:12:05.521172: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-03-31 11:12:05.539193: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-03-31 11:12:05.539639: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce GTX 1070 computeCapability: 6.1
coreClock: 1.7845GHz coreCount: 15 deviceMemorySize: 7.93GiB deviceMemoryBandwidth: 238.66GiB/s
2020-03-31 11:12:05.539841: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-03-31 11:12:05.541113: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-03-31 11:12:05.542119: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-03-31 11:12:05.542324: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-03-31 11:12:05.543632: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-03-31 11:12:05.544401: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-03-31 11:12:05.547212: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-03-31 11:12:05.547337: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-03-31 11:12:05.548015: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-03-31 11:12:05.548512: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-03-31 11:12:05.567845: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3393550000 Hz
2020-03-31 11:12:05.568364: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x564107e16440 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-03-31 11:12:05.568395: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version

我之前确实让Tensorflow在这个系统上工作过,所以我认为这可能是由系统更新引起的某种库问题。

GPU是Nvidia GTX 1070。Tensorflow的版本是2.1.0,从它工作的时候起就没有改变。运行Arch Linux,如果这很重要的话。

我尝试从CUDA 10.2降级到10.1,但问题仍然存在。

我也可以用PyTorch复制这个:

import torch
import transformers
t = torch.tensor([1,2,3])
t.cuda()

(import transformers防止了"CUDA:内存不足"的问题——它一定做了一些初始化PyTorch的事情,但我不知道该怎么做。(

这也有同样的问题,它冻结了一个CPU核心,尽管它产生的输出较少:

020-03-31 11:13:41.428483: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2020-03-31 11:13:41.428571: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2020-03-31 11:13:41.428587: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.

我很确定关于TensorRT的抱怨是不相关的,因为当我之前有这个工作时,它也会输出这些。

如何解决此问题?或者至少,我还能做些什么来确定它在冷冻时在做什么?

我的问题是由我对Python进程要大声消耗的虚拟内存量(在zsh中使用ulimit -Sv 12000000(设置的ulimit引起的。我不知道为什么这会导致它挂起,但如果其他人遇到类似的问题,请确保你没有限制虚拟内存。

最新更新