prometheus-blackbox- exporters发出假阳性警报



我们已经在我们的Kubernetes集群中使用社区helm图表建立了完整的Prometheus堆栈- Prometheus/Grafana/Alertmanager/Node Explorer/Blackbox导出器。监控堆栈部署在它自己的命名空间中,而我们的主软件(由微服务组成)部署在默认命名空间中。警报运行良好,但黑匣子出口商没有正确抓取指标(我猜),并定期发出假阳性警报。我们使用后者来探测我们的微服务HTTP活动/就绪端点。

与此问题相关的配置(在values.yaml中)如下:

- alert: InstanceDown
expr: up == 0
for: 5m
annotations:
title: 'Instance {{ $labels.instance }} down'
description: '{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes.'
- alert: ExporterIsDown
expr: up{job="prometheus-blackbox-exporter"} == 0
for: 5m
labels:
severity: warning
annotations:
summary: "Blackbox exporter is down"
description: "Blackbox exporter is down or not being scraped correctly"
...
...
...
extraScrapeConfigs:  |
- job_name: 'prometheus-blackbox-exporter'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- http://service1.default.svc.cluster.local:8082/actuator/health/liveness
- http://service2.default.svc.cluster.local:8081/actuator/health/liveness
- http://service3.default.svc.cluster.local:8080/actuator/health/liveness
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: prometheus-blackbox-exporter:9115

这两个警报每小时触发一次,但此时端点是100%可达的。

我们使用默认的prometheus-blackbox- exporters/values。yaml文件:

config:
modules:
http_2xx:
prober: http
timeout: 5s
http:
valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
no_follow_redirects: false
preferred_ip_protocol: "ip4"

邮件是这样的:

5] Firing
Labels
alertname = InstanceDown
instance = http://service1.default.svc.cluster.local:8082/actuator/health/liveness
job = prometheus-blackbox-exporter
severity = critical

另一种电子邮件类型

Labels
alertname = ExporterIsDown
instance = http://service1.default.svc.cluster.local:8082/actuator/health/liveness
job = prometheus-blackbox-exporter
severity = warning
Annotations
description = Blackbox exporter is down or not being scraped correctly
summary = Blackbox exporter is down

我注意到的另一件奇怪的事情是,在Prometheus UI中,我没有看到任何probe *指标,如图https://lapee79.github.io/en/article/monitoring-http-using-blackbox-exporter/所示我不知道我们做错了什么,或者遗漏了什么,但是收到数百封假阳性邮件真的很烦人。

回答我自己的问题。我好像打错了:

replacement: prometheus-blackbox-exporter:9115

,但必须是服务名:

replacement: stage-prometheus-blackbox-exporter:9115

根据文档:

replacement: localhost:9115 #黑匣子导出器的真实主机名:port。对于Windows和macOS替换为- host.docker.internal:9115

对于Kubernetes,它应该是blackbox- exporters的服务名,但没有很好的文档说明。或者至少我没有在任何地方找到这个。

获取服务:

kubectl get svc -l app.kubernetes.io/name=prometheus-blackbox-exporter

相关内容

  • 没有找到相关文章

最新更新