OpenShift:内存不足时重新启动pod异常



我有一个pod在OpenShift中运行。pod运行一个Kafka消费者,持续轮询一个主题,并在给定的时间内将记录存储在本地。偶尔,这个话题会获得大量新记录。由于存储记录所需的内存空间,这将导致OOM异常。然而,这很好,因为吊舱可以重新启动并再次消耗。

然而,问题是pod不会在OOM异常时重新启动。pod崩溃后,运行状况端点(服务器(仍然处于活动状态。因此,pod不会重新启动,因为OpenShift仍然认为pod是健康的。从日志消息来看,shutdownHook似乎从未运行过。

我的健康端点服务实现为

class HealthService : ILogging by Logging<HealthService>() {
@Get("/health")
fun health(): HttpResponse {
log.trace("I'm $responseText")
return HttpResponse.of(
statusCode,
MediaType.PLAIN_TEXT_UTF_8,
responseText
)
}
/**
* Should be called when the graceful shutdown process is completed. The service will now be
* considered dead by Kubernetes and the pod will be restarted.
*/
fun die() {
log.trace("Last breath...")
health.set(DEAD)
}
/** Thread-safe health state. */
private val health: AtomicInteger = AtomicInteger(ALIVE)
private val responseText
get() =
when (health.get()) {
ALIVE -> "alive"
SICK -> "sick"
else -> "dead"
}
private val statusCode
get() =
when (health.get()) {
DEAD -> HttpStatus.SERVICE_UNAVAILABLE
else -> HttpStatus.OK
}
companion object {
const val ALIVE = 0
const val SICK = 1
const val DEAD = 2
}
}

我的主要应用程序被实现为

val log = Logger()
lateinit var healthService: HealthService
fun run() {
val consumer = createKafkaConsumer()
val server = buildServer(log)
val future = server.start()
future.join()
Runtime.getRuntime().addShutdownHook(
Thread {
log.info("Closing down...")
server.close()
healthService.die()
}
)
consumer.run()
}
private fun buildServer(log: Logger): Server {
log.info("Loading HTTP Endpoints on port ${config.port}...")
val sb = Server.builder().http(config.port).service(
"/"
) { _, _ -> HttpResponse.of("OKn") }
healthService = HealthService()
sb.annotatedService(healthService)
return sb.build()
}

Kafka消费者被简单地实现为

class Consumer() {
val cache = Cache()
val name = "myConsumer"

fun run() {
try {
val pollDuration = config.kafka.pollDurationSeconds
while (true) {
val records = consumer.poll(Duration.ofSeconds(pollDuration))
addToCache(records)
}
} catch (e: Exception) {
log.error("Unexpected event happened. e=$e", e)
} finally {
log.info("Closing down $name consumer...")
consumer.close()
cache.close()
}
}

总之,在consumer.run()中抛出OOM异常,导致程序崩溃。但是,运行状况端点将继续运行。因此,OpenShift仍然认为程序/pod和pod不会重新启动。

当在consumer.run()中抛出OOM异常时,如何终止运行状况端点?

编辑:添加Kubernetes配置

...
readiness:
path: /health
liveness:
path: /health
....

理想情况下,当您的使用者或应用程序在该时间内存不足时,即使您的健康端点也应停止提供204200响应。

在Kubernetes方面,您必须使用Liveness来监控此端点。

一旦应用程序停止,Liveness将自动重新启动应用程序,并且不会将200204返回到K8s。

这里的活性探针YAML示例

apiVersion: v1
kind: Pod
metadata:
labels:
test: liveness
name: liveness-exec
spec:
containers:
- name: liveness
image: <Docker image>
livenessProbe:
exec:
command:
- cat
- /tmp/healthy
initialDelaySeconds: 5
periodSeconds: 5

这将在该周期的每5秒检查/tmp/health端点。

最新更新