docker容器的Dns查找在正常运行约36小时后中断



我有一个通过docker-compose (dns是通过docker守护进程dns服务器127.0.0.11完成)在主机上部署的单个容器,主机上的dns服务器配置为/etc/resolv.conf中的私有网络,并且无法访问互联网。

容器运行良好一段时间(约40小时),然后开始失败的dns查找超时消息:应用程序日志显示针对docker DNS服务器的失败:

Caused by: java.net.UnknownHostException: failed to resolve 'alfresco.test.duf'
at io.netty.resolver.dns.DnsResolveContext.finishResolve(DnsResolveContext.java:1013)
at io.netty.resolver.dns.DnsResolveContext.tryToFinishResolve(DnsResolveContext.java:966)
at io.netty.resolver.dns.DnsResolveContext.query(DnsResolveContext.java:414)
at io.netty.resolver.dns.DnsResolveContext.access$600(DnsResolveContext.java:63)
at io.netty.resolver.dns.DnsResolveContext$2.operationComplete(DnsResolveContext.java:463)
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578)
at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:571)
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:550)
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491)
at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616)
at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609)
at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117)
at io.netty.resolver.dns.DnsQueryContext.tryFailure(DnsQueryContext.java:225)
at io.netty.resolver.dns.DnsQueryContext$4.run(DnsQueryContext.java:177)
at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: io.netty.resolver.dns.DnsNameResolverTimeoutException: [/127.0.0.11:53] query via UDP timed out after 5000 milliseconds (no stack trace available)

docker守护进程日志显示针对本地网络dns服务器的失败:

Aug 25 12:19:15 st2510v dockerd[6749]: time="2021-08-25T12:19:15.066556867+02:00" level=warning msg="[resolver] connect failed: dial udp 157.164.138.33:53: connect: resource temporarily unavailable"

从docker主机ping目标服务器解析正确

在docker-network中启动一个bash容器(通过compose创建)并从那里ping目标服务器可以正确解析。

在有问题的容器内ping任何服务器(外部dns, docker dns, bashcontainer)表单都无法解决。

容器不能自行从错误中恢复。

重新启动或重新创建容器可以解决这个问题。

我将主机iptables和网络接口与一个完全没有这个问题的工作实例进行了比较,但是这并没有产生任何显著的差异。

关于问题是什么,或者如何诊断它可能是什么,有什么建议吗?

更新1

Docker版本输出:

[al6735@st2510v ~]$ sudo docker version
Client: Docker Engine - Community
Version:           19.03.5
API version:       1.40
Go version:        go1.12.12
Git commit:        633a0ea
Built:             Wed Nov 13 07:25:41 2019
OS/Arch:           linux/amd64
Experimental:      false
Server: Docker Engine - Community
Engine:
Version:          19.03.5
API version:      1.40 (minimum version 1.12)
Go version:       go1.12.12
Git commit:       633a0ea
Built:            Wed Nov 13 07:24:18 2019
OS/Arch:          linux/amd64
Experimental:     false
containerd:
Version:          1.2.13
GitCommit:        7ad184331fa3e55e52b890ea95e65ba581ae3429
runc:
Version:          1.0.0-rc10
GitCommit:        dc9208a3303feef5b3839f4323d9beb36df0a9dd
docker-init:
Version:          0.18.0
GitCommit:        fec3683

Docker info output:

[al6735@st2510v ~]$ sudo docker info
Client:
Debug Mode: false
Server:
Containers: 2
Running: 2
Paused: 0
Stopped: 0
Images: 3
Server Version: 19.03.5
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 7ad184331fa3e55e52b890ea95e65ba581ae3429
runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
init version: fec3683
Security Options:
seccomp
Profile: default
Kernel Version: 3.10.0-957.21.2.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 15.51GiB
Name: st2510v
ID: KTEE:M3ZD:5ZS5:DVFU:R6VJ:YV7Q:QPP5:D4YG:ITV7:YC3U:YP3J:AEDG
Docker Root Dir: /home/docker
Debug Mode: true
File Descriptors: 38
Goroutines: 48
System Time: 2021-09-24T14:23:42.314595155+02:00
EventsListeners: 0
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false

对主机的进一步检查显示,目标容器中的java应用程序持有大量tcp套接字。

修复以上问题后,连接问题不再发生。假设我们达到了一个容器可以拥有的打开套接字数量的限制。

最新更新