在docker容器中构建第三方库(libtorch
,如果重要的话(的过程中,我遇到了一个丢失包含文件的错误。当从Ubuntu 16.04主机运行构建过程时,同样的构建过程也很好,但当从Ubuntu 18.04主机运行时,文件丢失了。
经过一点追溯,我现在只是从NVidia运行基本容器,并查找文件。这是我得到的输出:
Ubuntu 16.04 host
:
$ uname -a
Linux ub-carmel 4.15.0-123-generic #126~16.04.1-Ubuntu SMP Wed Oct 21 13:48:05 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
$ docker --version
Docker version 19.03.13, build 4484c46d9d
$ docker pull nvcr.io/nvidia/cuda:11.1-cudnn8-devel-ubuntu18.04
11.1-cudnn8-devel-ubuntu18.04: Pulling from nvidia/cuda
Digest: sha256:c5bf5c984998cc18a3f3a741c2bd7187ed860dc6d993b6fb402d0effb9fe6579
Status: Image is up to date for nvcr.io/nvidia/cuda:11.1-cudnn8-devel-ubuntu18.04
nvcr.io/nvidia/cuda:11.1-cudnn8-devel-ubuntu18.04
$ docker run -it nvcr.io/nvidia/cuda:11.1-cudnn8-devel-ubuntu18.04
root@2ecc17248fab:/# ll /usr/lib/gcc/x86_64-linux-gnu/7/include | grep ia32
-rw-r--r-- 1 root root 7817 Dec 4 2019 ia32intrin.h
Ubuntu 18.04 host
:
$ uname -a
Linux ub-carmel-18-04 5.4.0-56-generic #62~18.04.1-Ubuntu SMP Tue Nov 24 10:07:50 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
$ docker --version
Docker version 19.03.14, build 5eb3275d40
$ docker pull nvcr.io/nvidia/cuda:11.1-cudnn8-devel-ubuntu18.04
11.1-cudnn8-devel-ubuntu18.04: Pulling from nvidia/cuda
Digest: sha256:c5bf5c984998cc18a3f3a741c2bd7187ed860dc6d993b6fb402d0effb9fe6579
Status: Downloaded newer image for nvcr.io/nvidia/cuda:11.1-cudnn8-devel-ubuntu18.04
nvcr.io/nvidia/cuda:11.1-cudnn8-devel-ubuntu18.04
$ docker run -it nvcr.io/nvidia/cuda:11.1-cudnn8-devel-ubuntu18.04
root@89f771e82a51:/# ll /usr/lib/gcc/x86_64-linux-gnu/7/include | grep ia32
root@89f771e82a51:/#
正如你所看到的,图像的sha256摘要是相同的(并且与NVidia的NGC的摘要匹配(
起初,我认为可能以某种隐藏的方式包含来自主机,但ia32intrin.h
文件存在于两个主机中
是什么导致了这样的问题?
编辑
添加了每个主机的docker --version
输出。这是有区别的,但我怀疑这是否会导致这样的问题
编辑2
添加了uname -a
的输出
编辑3
docker version
:输出
Ubuntu 16
:
$ docker version
Client: Docker Engine - Community
Version: 19.03.13
API version: 1.40
Go version: go1.13.15
Git commit: 4484c46d9d
Built: Wed Sep 16 17:02:59 2020
OS/Arch: linux/amd64
Experimental: false
Server: Docker Engine - Community
Engine:
Version: 19.03.13
API version: 1.40 (minimum version 1.12)
Go version: go1.13.15
Git commit: 4484c46d9d
Built: Wed Sep 16 17:01:30 2020
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.3.7
GitCommit: 8fba4e9a7d01810a393d5d25a3621dc101981175
runc:
Version: 1.0.0-rc10
GitCommit: dc9208a3303feef5b3839f4323d9beb36df0a9dd
docker-init:
Version: 0.18.0
GitCommit: fec3683
Ubuntu 18
:
$ docker version
Client: Docker Engine - Community
Version: 19.03.14
API version: 1.40
Go version: go1.13.15
Git commit: 5eb3275d40
Built: Tue Dec 1 19:20:17 2020
OS/Arch: linux/amd64
Experimental: false
Server: Docker Engine - Community
Engine:
Version: 19.03.14
API version: 1.40 (minimum version 1.12)
Go version: go1.13.15
Git commit: 5eb3275d40
Built: Tue Dec 1 19:18:45 2020
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.3.9
GitCommit: ea765aba0d05254012b0b9e595e995c09186427f
runc:
Version: 1.0.0-rc10
GitCommit: dc9208a3303feef5b3839f4323d9beb36df0a9dd
docker-init:
Version: 0.18.0
GitCommit: fec3683
所以我在不同的Ubuntu机器(EC2实例(上测试了它,在这种情况下,18.04&16.04文件存在。看来我的机器出了问题。有什么想法可以引起这种情况吗?
最好的猜测是Ubuntu 18.04主机上的拉层在某种程度上已经损坏。清理这一问题的核选项是重置docker。这将删除所有映像、卷、容器、日志、网络等,因此在运行此之前备份您想要保留的任何内容:
sudo -s # these commands need root
systemctl stop docker
rm -rf /var/lib/docker
systemctl start docker
exit # exit sudo