我已经为我的pod实现了一个http健康检查和一个单独的http活动检查。对于这两种情况,如果pod在响应之前延迟,我看到Kubernetes可以按预期工作。但是,当它们立即响应状态500时,Kubernetes将其视为成功响应。这是在pod启动并正常运行之后-在检查开始返回状态500之前。
实际上,我看到返回状态500实际上重置了失败计数,因此它使我的pod再次被视为健康的。
问题是我是否做错了什么?当我的pod不健康时,如何让Kubernetes做它的事情?
$ k version
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.8", GitCommit:"9f2892aab98fe339f3bd70e3c470144299398ace", GitTreeState:"clean", BuildDate:"2020-08-13T16:12:48Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.1", GitCommit:"c4d752765b3bbac2237bf87cf0b1c2e307844666", GitTreeState:"clean", BuildDate:"2020-12-18T12:00:47Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}
为了调查这个问题,我在我的pod中添加了测试端点,这样我就可以在运行时改变行为:通过(200),失败(500),延迟失败(等待15秒,然后返回500)。我把健康终点和生活终点分开了。
From description pod:
Liveness: exec [curl http://localhost:30030/livez] delay=10s timeout=1s period=10s #success=1 #failure=5
Readiness: exec [curl http://localhost:30030/healthz] delay=10s timeout=1s period=10s #success=1 #failure=3
我通过执行到pod中测试端点,并从那里卷曲端点(详细信息如下)。
然后,我在三种模式中循环了活动性检查和健康检查,并监控了Kubernetes的响应。
健康检查:连续5次健康检查失败后,期望pod重新启动。
Liveness Check:描述服务,并期望pod的IP地址从端点列表中删除。
成功案例:
bash-4.4$ curl http://localhost:30030/unfailhealth
unfailhealth: REMOVE force all health checks to fail, was failHealth=false, delayFailHealth=false
bash-4.4$ curl http://localhost:30030/healthz -v
* Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 30030 (#0)
> GET /healthz HTTP/1.1
> Host: localhost:30030
> User-Agent: curl/7.61.1
> Accept: */*
>
< HTTP/1.1 200 OK
< X-Powered-By: Express
< Content-Type: text/html; charset=utf-8
< Content-Length: 3
< ETag: W/"3-CftlTBfMBbEe9TvTWqcB9tVQ6OE"
< Date: Fri, 05 Feb 2021 13:30:59 GMT
< Connection: keep-alive
< Keep-Alive: timeout=5
<
OK
* Connection #0 to host localhost left intact
失败案例:
bash-4.4$ curl http://localhost:30030/failhealth
failhealth: force all health checks to fail, was failHealth=true, delayFailHealth=false
bash-4.4$ curl http://localhost:30030/healthz -v
* Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 30030 (#0)
> GET /healthz HTTP/1.1
> Host: localhost:30030
> User-Agent: curl/7.61.1
> Accept: */*
>
< HTTP/1.1 500 Internal Server Error
< X-Powered-By: Express
< Content-Type: text/html; charset=utf-8
< Content-Length: 26
< ETag: W/"1a-yI5D4Rtao1KH34GZVYKKvxZoEVo"
< Date: Fri, 05 Feb 2021 13:29:14 GMT
< Connection: keep-alive
< Keep-Alive: timeout=5
<
FAKE HEALTH CHECK FAILURE
* Connection #0 to host localhost left intact
延迟失效案例:
bash-4.4$ curl http://localhost:30030/delayfailhealth
delayfailhealth: force all health checks to sleep 15sec, then fail, was failHealth=false, delayFailHealth=true
bash-4.4$ date; curl http://localhost:30030/healthz -v
Fri Feb 5 13:33:08 UTC 2021
* Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 30030 (#0)
> GET /healthz HTTP/1.1
> Host: localhost:30030
> User-Agent: curl/7.61.1
> Accept: */*
>
< HTTP/1.1 500 Internal Server Error
< X-Powered-By: Express
< Content-Type: text/html; charset=utf-8
< Content-Length: 47
< ETag: W/"2f-n+Ix8oU/09OT9+cpPVm1/EejE9Y"
< Date: Fri, 05 Feb 2021 13:33:23 GMT
< Connection: keep-alive
< Keep-Alive: timeout=5
<
FAKE HEALTH CHECK FAILURE - AFTER 15 SEC DELAY
* Connection #0 to host localhost left intact
测试结果
对于运行状况和活动状态端点都默认为SUCCESS,返回状态200 ->pod启动工作正常
设置活度检查为FAIL,返回状态500 ->没有变化,pod IP仍然在服务中,请求仍然发送到pod。
将响应前的活动性检查设置为DELAY(然后是500)->pod从Kubernetes服务中移除(yippee)
再次设置live check为FAIL(快速)->Pod恢复到服务(视为成功)。
设置健康检查为FAIL(返回状态500)->没有效果,pod继续不重启。
将健康检查设置为响应前延迟(然后是500)->5次探测失败后重启Pod
谢谢你的帮助。我想我可以改变我的代码延迟之前响应在失败的情况下,但这似乎是一个解决方案。
感谢@mdaniel的评论,问题解决了。在这里展开,因为我花了一段时间才完全理解这条评论。
问题出在pod规范中的健康和活动检查配置中。
readinessProbe:
exec:
command:
- curl
- http://localhost:30030/healthz
failureThreshold: 3
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
这依赖于exec子句中curl
命令的输出。
Curl总是以代码0退出。如果你想用旋度,就用curl -f
。如果出现错误,它将以非零退出。
但最好在pod规范中使用httpGet
,像这样
readinessProbe:
httpGet:
path: /healthz
port: 30030
scheme: HTTP
failureThreshold: 3
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
我测试了两者,两者都工作。我将按照建议使用httpGet
-这是适合这项工作的工具。
请注意,使用exec/curl而不是httpGet的原因是pod使用TLS,这会阻止来自Kubernetes pod的http。引用https://medium.com/cloud -原生gathering/kubernetes -活性探针- -抓形象启用了- istio mtl - - 90543 e4bae34
谢谢!