我在 AWS 上设置了我的 kubernetes 集群,我正在尝试使用 cAdvisor + Prometheus + Alert Manager 监控多个 Pod。我想做的是,如果容器/pod 出现故障或卡在错误或 CarshLoopBackOff 状态或 stcuk 处于除运行之外的任何其他状态,它会启动电子邮件警报(带有服务/容器名称(。
Prometheus 收集了广泛的指标。例如,您可以使用指标kube_pod_container_status_restarts_total
来监视重启,这将反映你的问题。
它包含可在警报中使用的标记:
- 容器=
container-name
- 命名空间 =
pod-namespace
- 豆荚=
pod-name
因此,您所需要的只是通过添加正确的SMTP设置,接收器和规则来配置alertmanager.yaml
配置:
global:
# The smarthost and SMTP sender used for mail notifications.
smtp_smarthost: 'localhost:25'
smtp_from: 'alertmanager@example.org'
smtp_auth_username: 'alertmanager'
smtp_auth_password: 'password'
receivers:
- name: 'team-X-mails'
email_configs:
- to: 'team-X+alerts@example.org'
# Only one default receiver
route:
receiver: team-X-mails
# Example group with one alert
groups:
- name: example-alert
rules:
# Alert about restarts
- alert: RestartAlerts
expr: count(kube_pod_container_status_restarts_total) by (pod-name) > 5
for: 10m
annotations:
summary: "More than 5 restarts in pod {{ $labels.pod-name }}"
description: "{{ $labels.container-name }} restarted (current value: {{ $value }}s) times in pod {{ $labels.pod-namespace }}/{{ $labels.pod-name }}"
我正在使用这个:
- alert: PodCrashLooping
annotations:
description: Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) is restarting {{ printf "%.2f" $value }} times / 5 minutes.
summary: Pod is crash looping.
expr: rate(kube_pod_container_status_restarts_total{job="kube-state-metrics",namespace=~".*"}[5m]) * 60 * 5 > 0
for: 5m
labels:
severity: critical