启用 Docker GPU 的版本 (>19.03) 未成功加载张量流



我想使用 docker 19.03 及更高版本以获得 GPU 支持。我目前系统中有 docker 19.03.12。我可以运行以下命令来检查 Nvidia 驱动程序是否正在运行:

docker run -it --rm --gpus all ubuntu nvidia-smi
Wed Jul  1 14:25:55 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.64       Driver Version: 430.64       CUDA Version: N/A      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 107...  Off  | 00000000:01:00.0 Off |                  N/A |
| 26%   54C    P5    13W / 180W |    734MiB /  8119MiB |     39%      Default |
+-------------------------------+----------------------+----------------------+
         
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

此外,如果在本地运行,我的模块可以很好地与 GPU 支持配合使用。但是,如果我构建一个 docker 映像并尝试运行它,我会收到一条消息:

ImportError: libcuda.so.1: cannot open shared object file: No such file or directory

我正在使用带有张量流 9.0 的 cuda 1.12.0,但我可以用张量流 1.15.
切换到 cuda 10.0 正如我得到的,问题是我可能正在使用以前的 dockerfile 版本,该版本带有命令,该命令无法使其与新的 docker GPU 启用版本(19.03 及更高版本(兼容.
实际命令如下:

FROM nvidia/cuda:9.0-base-ubuntu16.04
# Pick up some TF dependencies
RUN apt-get update && apt-get install -y --no-install-recommends 
build-essential 
cuda-command-line-tools-9-0 
cuda-cublas-9-0 
cuda-cufft-9-0 
cuda-curand-9-0 
cuda-cusolver-9-0 
cuda-cusparse-9-0 
libcudnn7=7.0.5.15-1+cuda9.0 
libnccl2=2.2.13-1+cuda9.0 
libfreetype6-dev 
libhdf5-serial-dev 
libpng12-dev 
libzmq3-dev 
pkg-config 
software-properties-common 
unzip 
&& 
apt-get clean && 
rm -rf /var/lib/apt/lists/*
RUN apt-get update && 
apt-get install nvinfer-runtime-trt-repo-ubuntu1604-4.0.1-ga-cuda9.0 && 
apt-get update && 
apt-get install libnvinfer4=4.1.2-1+cuda9.0

我也找不到用于基本 GPU 使用的 docker 基本文件。

在这个答案中,有一个公开libcuda.so.1的建议,但它在我的情况下不起作用。

那么,是否有解决此问题或要调整的基本 dockerfile 的任何解决方法?

我的系统是 Ubuntu 16.04。

编辑:

我刚刚注意到 docker 中的 nvidia-smi 没有显示任何 cuda 版本:

CUDA Version: N/A

与本地运行的版本形成鲜明对比。所以,这可能意味着由于某种原因,我猜没有在 docker 中加载 cuda。

TLDR;

一个似乎适用于docker 19.03+和cuda 10的基本Dockerfile是这样的:

FROM nvidia/cuda:10.0-base

可以与 tf 1.14 结合使用,但由于某种原因找不到 tf 1.15。

我只是用这个 Dockerfile 来测试它:

FROM nvidia/cuda:10.0-base
CMD nvidia-smi

更长的答案:

好吧,经过大量的试验和错误(和挫折(,我设法让它适用于 docker 19.03.12+cuda 10(尽管使用 tf 1.14 而不是 1.15(。

我使用了这篇文章中的代码,并使用了那里提供的基本Dockerfiles。

首先,我尝试使用 Dockerfile 从 docker 内部检查nvidia-smi

FROM nvidia/cuda:10.0-base
CMD nvidia-smi
$docker build -t gpu_test .
...
$docker run -it --gpus all gpu_test
Fri Jul  3 07:31:05 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.64       Driver Version: 430.64       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 107...  Off  | 00000000:01:00.0 Off |                  N/A |
| 45%   65C    P2   142W / 180W |   8051MiB /  8119MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
         
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

似乎终于找到了 CUDA 二进制文件:CUDA Version: 10.1.

然后,我制作了一个最小的 Dockerfile,它可以测试在 docker 中成功加载张量流二进制库:

FROM nvidia/cuda:10.0-base
# The following are just declaring variables and ultimately use
ARG USE_PYTHON_3_NOT_2=True
ARG _PY_SUFFIX=${USE_PYTHON_3_NOT_2:+3}
ARG PYTHON=python${_PY_SUFFIX}
ARG PIP=pip${_PY_SUFFIX}
RUN apt-get update && apt-get install -y 
${PYTHON} 
${PYTHON}-pip
RUN ${PIP} install tensorflow_gpu==1.14.0
COPY bashrc /etc/bash.bashrc
RUN chmod a+rwx /etc/bash.bashrc
WORKDIR /src
COPY *.py /src/
ENTRYPOINT ["python3", "tf_minimal.py"]

tf_minimal.py很简单:

import tensorflow as tf
print(tf.__version__)

为了完整起见,我只是发布我正在使用的 bashrc 文件:

# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# ==============================================================================
export PS1="[e[31m]tf-docker[e[m] [e[33m]w[e[m] > "
export TERM=xterm-256color
alias grep="grep --color=auto"
alias ls="ls --color=auto"
echo -e "e[1;31m"
cat<<TF
________                               _______________                
___  __/__________________________________  ____/__  /________      __
__  /  _  _ _  __ _  ___/  __ _  ___/_  /_   __  /_  __ _ | /| / /
_  /   /  __/  / / /(__  )/ /_/ /  /   _  __/   _  / / /_/ /_ |/ |/ / 
/_/    ___//_/ /_//____/ ____//_/    /_/      /_/  ____/____/|__/
TF
echo -e "e[0;33m"
if [[ $EUID -eq 0 ]]; then
cat <<WARN
WARNING: You are running this container as root, which can cause new files in
mounted volumes to be created as the root user on your host machine.
To avoid this, run the container by specifying your user's userid:
$ docker run -u $(id -u):$(id -g) args...
WARN
else
cat <<EXPL
You are running this container as user with ID $(id -u) and group $(id -g),
which should map to the ID and group for your user on the Docker host. Great!
EXPL
fi
# Turn off colors
echo -e "e[m"

最新更新