另一台服务器上的Concourse Worker失去与Concourse Web的连接



我们有一个Concourse Web容器和一个Concourse Worker容器运行在服务器a上(212.77.7.255 -真实IP是构想的)。我们使用最新的Concourse Version 7.8.1。

当我们用完Worker资源时,我们添加了另一个运行在服务器B上的Concourse Worker容器。服务器B上的Worker已经正常运行了大约五天,但突然它无法再连接到服务器a上的Concourse Web。

服务器B上Worker的日志显示:

{
"timestamp": "2022-07-12T11:15:59.542 985762Z",
"level": "error",
"source": "worker",
"message": "worker.container-sweeper.tick.failed-to-connect-to-tsa",
"data": {
"error": "dial tcp 212.77.7.255:2222: i/o timeout",
"session": "6.4"
}
}{
"timestamp": "2022-07-12T11:15:59.5430446562",
"level": "error",
"source": "worker",
"message": "worker.container-sweeper.tick.dial.failed-to-connect-to-any-tsa",
"data": {
"error": "all worker SSH gateways unreachable",
"session": "6.4.2"
}
}{
"timestamp": "2022-07-12T11:15:59.5430608042",
"level": "error",
"source": "worker",
"message": "worker.container-sweeper.tick.failed-to-dial",
"data": {
"error": "all worker SSH gateways unreachable",
"session": "6.4"
}
}{
"timestamp": "2022-07-12T11:15:59.5430689532",
"level": "error",
"source": "worker",
"message": "worker.container-sweeper.tick.failed-to-get-containers-to-destroy",
"data": {
"error": "all worker SSH gateways unreachable",
"session": "6.4"
}
}{
"timestamp": "2022-07-12T11:15:59.5541187512",
"level": "error",
"source": "worker",
"message": "worker.volume-sweeper. tick.failed-to-connect-to-tsa",
"data": {
"error": "dial tcp 212.77.7.255:2222: i/o timeout",
"session": "7.4"
}
}{
"timestamp": "2022-07-12T11:15:59.5541648442",
"level": "error",
"source": "worker",
"message": "worker.volume-sweeper.tick.dial.failed-to-connect-to-any-tsa",
"data": {
"error": "all worker SSH gateways unreachable",
"session": "7.4.3"
}
}{
"timestamp": "2022-07-12T11:15:59.5541725932",
"level": "error",
"source": "worker",
"message": "worker.volume-sweeper.tick.failed-to-dial",
"data": {
"error": "all worker SSH gateways unreachable",
"session": "7.4"
}
}{
"timestamp": "2022-07-12T11:15:59.554179789Z",
"level": "error",
"source": "worker",
"message": "worker.volume-sweeper. tick. failed-to-get-volume 3-to-destroy",
"data": {
"error": "all worker SSH gateways unreachable",
"session": "7.4"
}
}{
"timestamp": "2022-07-12T11:16:04.5802200122",
"level": "error",
"source": "worker",
"message": "worker.beacon-runner.beacon. failed-to-connect-to-tsa",
"data": {
"error": "dial tcp 212.77.7.255:2222: i/o timeout",
"session": "4.1"
}
}{
"timestamp": "2022-07-12T11:16:04.580284659Z",
"level": "error",
"source": "worker",
"message": "worker.beacon-runner.beacon.dial.failed-to-connect-to-any-tsa",
"data": {
"error": "all worker SSH gateways unreachable",
"session": "4.1.10"
}
}{
"timestamp": "2022-07-12T11:16:04.5803353772",
"level": "error",
"source": "worker",
"message": "worker.beacon-runner.beacon.failed-to-dial",
"data": {
"error": "all worker SSH gateways unreachable",
"session": "4.1"
}
}{
"timestamp": "2022-07-12T11:16:04.5803598682",
"level": "error",
"source": "worker",
"message": "worker.beacon-runner.beacon.exited-with-error",
"data": {
"error": "all worker SSH gateways unreachable",
"session": "4.1"
}
}{
"timestamp": "2022-07-12T11:16:04.580372552Z",
"level": "debug",
"source": "worker",
"message",
"worker.beacon-runner.beacon.done",
"data": {
"session": "4.1"
}
}{
"timestamp": "2022-07-12T11:16:04.5803948792",
"level": "error",
"source": "worker",
"message": "worker.beacon-runner.failed",
"data": {
"error": "all worker SSH gateways unreachable",
"session": "4"
}
}

服务器A上的Concourse Web的日志显示服务器B上的Worker没有尝试连接的条目。在服务器B上,我能够连接到服务器A上的大厅Web:

$ nc 212.77.7.255 2222
SSH-2.0-Go

我们以前有这个问题,但我们通过将Concourse升级到最新版本7.8.1来解决它。现在我已经没有选择在哪里调试这个了。我试过的:

  • 重新启动工作
  • 重启web容器
  • 正在删除服务器B的工作线程
  • docker system prune在服务器B上

没有帮助。我能做些什么来进一步调试并使服务器B上的工作器再次连接?

您说它发生在较早的版本中,您"耗尽了Worker资源",并且我在日志中看到I/O超时…你没有提到的一个组件是DB。

可能是数据库上的最大conns已经达到,特别是当数据库不仅仅用于Concourse时。这就是我看。

我们不知道为什么docker网络不允许连接到服务器a。当主机上的连接正常时,我们告诉docker使用主机网络:

services:
concourse-worker:
...
network-mode: host
...

这就解决了问题。这不是一个很好的解决方案,因为docker容器应该有它自己的独立网络,但是由于在这个服务器上没有其他东西运行,所以它很好。

最新更新