Prometheus:如何禁用1个特定job_name的1条规则?



我为2个elasticsearch集群设置prometheus alert(使用elasticsearch_export), 1个有8个节点,1个有3个节点。我想要的是在每个集群丢失1个节点时发送警报,但现在所有规则都适用于两个集群。所以这是不可能的。

普罗米修斯。yml文件

global:
scrape_interval: 10s
rule_files:
- alert.rules.yml
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
scrape_configs:
- job_name: cluster1
scrape_interval: 30s
scrape_timeout:  30s
metrics_path: "/metrics"
static_configs:
- targets: ['xxx1:9114' ]
labels:
service: cluster1
- job_name: cluster2
scrape_interval: 30s
scrape_timeout:  30s
metrics_path: "/metrics"
static_configs:
- targets: ['xxx2:9114' ]
labels:
service: cluster2

alert.rules.yml文件:

groups:
- name: alert.rules
rules:
- alert: ElasticsearchLostNode
expr: elasticsearch_cluster_health_number_of_nodes < 8
for: 1m
labels:
severity: warning
annotations:
summary: Elasticsearch Healthy Nodes (instance {{ $labels.instance }})
description: Number Healthy Nodes less than 8
...

Ofc the number_of_nodes <8对于小集群总是成立的,如果我设置<3、大集群丢失1个节点时不触发告警

是否有办法为1个特定的job_name豁免1个特定的规则,或者定义这些规则a适用于1个特定的job_name a,这些规则B适用于1个特定的job_name B?

是的,您可以在alert.rules.yml文件中为每个作业创建一个规则:

groups:
- name: alert.rules
rules:
- alert: ElasticsearchLostNode1
expr: elasticsearch_cluster_health_number_of_nodes{job="cluster1"} < 8
...
- alert: ElasticsearchLostNode2
expr: elasticsearch_cluster_health_number_of_nodes{job="cluster2"} < 3
...

最新更新