promethus不能每次都获得度量,从而在显示度量时产生间隙



我以独立模式在K8上部署了几个flink,并通过一个promethus pushgateway导出它们的度量。

问题是:
度量数据间歇性地到达promethus,导致在grafana中显示时点之间出现间隙

点击我,显示缺口图


promethus目标:

monitoring/pushgateway/0 (1/1 up)
Endpoint: http://172.19.88.111:9091/metrics
State   : UP
Labels: endpoint="tcp" instance="172.19.88.111:9091" job="pushgateway" namespace="flink-sql" pod="pushgateway-76d64545dd-6prdn" service="pushgateway"

我直接查询pushgateway,但不能每次都得到所有度量

bash-5.0# date  &&  curl  -s http://pushgateway.flink-sql:9091/metrics      | grep flink_jobmanager_numRegisteredTaskManagers
Mon May 24 07:15:17 UTC 2021
# HELP flink_jobmanager_numRegisteredTaskManagers numRegisteredTaskManagers (scope: jobmanager)
# TYPE flink_jobmanager_numRegisteredTaskManagers gauge
flink_jobmanager_numRegisteredTaskManagers{host="flink_jobmanager",instance="",job="flink-sql"} 0
bash-5.0# date  &&  curl  -s http://pushgateway.flink-sql:9091/metrics      | grep flink_jobmanager_numRegisteredTaskManagers
Mon May 24 07:15:18 UTC 2021
# HELP flink_jobmanager_numRegisteredTaskManagers numRegisteredTaskManagers (scope: jobmanager)
# TYPE flink_jobmanager_numRegisteredTaskManagers gauge
flink_jobmanager_numRegisteredTaskManagers{host="flink_jobmanager",instance="",job="flink-sql"} 0
bash-5.0# date  &&  curl  -s http://pushgateway.flink-sql:9091/metrics      | grep flink_jobmanager_numRegisteredTaskManagers
Mon May 24 07:15:18 UTC 2021
# HELP flink_jobmanager_numRegisteredTaskManagers numRegisteredTaskManagers (scope: jobmanager)
# TYPE flink_jobmanager_numRegisteredTaskManagers gauge
flink_jobmanager_numRegisteredTaskManagers{host="172_19_90_175",instance="",job="model1122"} 8
flink_jobmanager_numRegisteredTaskManagers{host="flink_jobmanager",instance="",job="flink-sql"} 0
bash-5.0# date  &&  curl  -s http://pushgateway.flink-sql:9091/metrics      | grep flink_jobmanager_numRegisteredTaskManagers
Mon May 24 07:15:19 UTC 2021
# HELP flink_jobmanager_numRegisteredTaskManagers numRegisteredTaskManagers (scope: jobmanager)
# TYPE flink_jobmanager_numRegisteredTaskManagers gauge
flink_jobmanager_numRegisteredTaskManagers{host="172_19_90_175",instance="",job="model1122"} 8
bash-5.0# date  &&  curl  -s http://pushgateway.flink-sql:9091/metrics      | grep flink_jobmanager_numRegisteredTaskManagers
Mon May 24 07:15:20 UTC 2021
# HELP flink_jobmanager_numRegisteredTaskManagers numRegisteredTaskManagers (scope: jobmanager)
# TYPE flink_jobmanager_numRegisteredTaskManagers gauge
flink_jobmanager_numRegisteredTaskManagers{host="flink_jobmanager",instance="",job="flink-sql"} 0
bash-5.0# date  &&  curl  -s http://pushgateway.flink-sql:9091/metrics      | grep flink_jobmanager_numRegisteredTaskManagers
Mon May 24 07:15:20 UTC 2021
# HELP flink_jobmanager_numRegisteredTaskManagers numRegisteredTaskManagers (scope: jobmanager)
# TYPE flink_jobmanager_numRegisteredTaskManagers gauge
flink_jobmanager_numRegisteredTaskManagers{host="172_19_90_175",instance="",job="model1122"} 8
flink_jobmanager_numRegisteredTaskManagers{host="flink_jobmanager",instance="",job="flink-sql"} 0
flink_jobmanager_numRegisteredTaskManagers{host="jobmanager",instance="",job="model"} 20
bash-5.0# date  &&  curl  -s http://pushgateway.flink-sql:9091/metrics      | grep flink_jobmanager_numRegisteredTaskManagers
Mon May 24 07:15:20 UTC 2021
# HELP flink_jobmanager_numRegisteredTaskManagers numRegisteredTaskManagers (scope: jobmanager)
# TYPE flink_jobmanager_numRegisteredTaskManagers gauge
flink_jobmanager_numRegisteredTaskManagers{host="172_19_90_175",instance="",job="model1122"} 8
flink_jobmanager_numRegisteredTaskManagers{host="flink_jobmanager",instance="",job="flink-sql"} 0
bash-5.0# date  &&  curl  -s http://pushgateway.flink-sql:9091/metrics      | grep flink_jobmanager_numRegisteredTaskManagers
Mon May 24 07:15:21 UTC 2021
# HELP flink_jobmanager_numRegisteredTaskManagers numRegisteredTaskManagers (scope: jobmanager)
# TYPE flink_jobmanager_numRegisteredTaskManagers gauge
flink_jobmanager_numRegisteredTaskManagers{host="flink_jobmanager",instance="",job="flink-sql"} 0
bash-5.0# date  &&  curl  -s http://pushgateway.flink-sql:9091/metrics      | grep flink_jobmanager_numRegisteredTaskManagers
Mon May 24 07:15:22 UTC 2021
# HELP flink_jobmanager_numRegisteredTaskManagers numRegisteredTaskManagers (scope: jobmanager)
# TYPE flink_jobmanager_numRegisteredTaskManagers gauge
flink_jobmanager_numRegisteredTaskManagers{host="172_19_90_175",instance="",job="model1122"} 8
flink_jobmanager_numRegisteredTaskManagers{host="flink_jobmanager",instance="",job="flink-sql"} 0
bash-5.0# date  &&  curl  -s http://pushgateway.flink-sql:9091/metrics      | grep flink_jobmanager_numRegisteredTaskManagers
Mon May 24 07:15:22 UTC 2021
# HELP flink_jobmanager_numRegisteredTaskManagers numRegisteredTaskManagers (scope: jobmanager)
# TYPE flink_jobmanager_numRegisteredTaskManagers gauge
flink_jobmanager_numRegisteredTaskManagers{host="172_19_90_175",instance="",job="model1122"} 8
bash-5.0# date  &&  curl  -s http://pushgateway.flink-sql:9091/metrics      | grep flink_jobmanager_numRegisteredTaskManagers
Mon May 24 07:15:23 UTC 2021
# HELP flink_jobmanager_numRegisteredTaskManagers numRegisteredTaskManagers (scope: jobmanager)
# TYPE flink_jobmanager_numRegisteredTaskManagers gauge
flink_jobmanager_numRegisteredTaskManagers{host="flink_jobmanager",instance="",job="flink-sql"} 0
bash-5.0# date  &&  curl  -s http://pushgateway.flink-sql:9091/metrics      | grep flink_jobmanager_numRegisteredTaskManagers
Mon May 24 07:15:23 UTC 2021
# HELP flink_jobmanager_numRegisteredTaskManagers numRegisteredTaskManagers (scope: jobmanager)
# TYPE flink_jobmanager_numRegisteredTaskManagers gauge
flink_jobmanager_numRegisteredTaskManagers{host="172_19_90_175",instance="",job="model1122"} 8
flink_jobmanager_numRegisteredTaskManagers{host="flink_jobmanager",instance="",job="flink-sql"} 0
flink_jobmanager_numRegisteredTaskManagers{host="jobmanager",instance="",job="model"} 20
bash-5.0# date  &&  curl  -s http://pushgateway.flink-sql:9091/metrics      | grep flink_jobmanager_numRegisteredTaskManagers
Mon May 24 07:15:24 UTC 2021
# HELP flink_jobmanager_numRegisteredTaskManagers numRegisteredTaskManagers (scope: jobmanager)
# TYPE flink_jobmanager_numRegisteredTaskManagers gauge
flink_jobmanager_numRegisteredTaskManagers{host="flink_jobmanager",instance="",job="flink-sql"} 0
bash-5.0# date  &&  curl  -s http://pushgateway.flink-sql:9091/metrics      | grep flink_jobmanager_numRegisteredTaskManagers
Mon May 24 07:15:24 UTC 2021
# HELP flink_jobmanager_numRegisteredTaskManagers numRegisteredTaskManagers (scope: jobmanager)
# TYPE flink_jobmanager_numRegisteredTaskManagers gauge
flink_jobmanager_numRegisteredTaskManagers{host="172_19_90_175",instance="",job="model1122"} 8
flink_jobmanager_numRegisteredTaskManagers{host="flink_jobmanager",instance="",job="flink-sql"} 0
bash-5.0# date  &&  curl  -s http://pushgateway.flink-sql:9091/metrics      | grep flink_jobmanager_numRegisteredTaskManagers
Mon May 24 07:15:25 UTC 2021
bash-5.0# date  &&  curl  -s http://pushgateway.flink-sql:9091/metrics      | grep flink_jobmanager_numRegisteredTaskManagers
Mon May 24 07:15:26 UTC 2021
# HELP flink_jobmanager_numRegisteredTaskManagers numRegisteredTaskManagers (scope: jobmanager)
# TYPE flink_jobmanager_numRegisteredTaskManagers gauge
flink_jobmanager_numRegisteredTaskManagers{host="flink_jobmanager",instance="",job="flink-sql"} 0
bash-5.0# date  &&  curl  -s http://pushgateway.flink-sql:9091/metrics      | grep flink_jobmanager_numRegisteredTaskManagers
Mon May 24 07:15:27 UTC 2021
# HELP flink_jobmanager_numRegisteredTaskManagers numRegisteredTaskManagers (scope: jobmanager)
# TYPE flink_jobmanager_numRegisteredTaskManagers gauge
flink_jobmanager_numRegisteredTaskManagers{host="flink_jobmanager",instance="",job="flink-sql"} 0

我的flink-conf.yaml 中的配置

metrics.reporter.promgateway.class: org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter
metrics.reporter.promgateway.host: pushgateway.flink-sql
metrics.reporter.promgateway.port: 9091
metrics.reporter.promgateway.jobName: flink-sql
metrics.reporter.promgateway.randomJobNameSuffix: false
metrics.reporter.promgateway.deleteOnShutdown: false
metrics.reporter.promgateway.interval: 3 SECONDS

即使将promethus Scrape intervalmetrics.reporter.promgateway.interval设置为1秒也没有效果;

我想:

  • promethus的间隙图结果没有存储连续的数据。

  • 普罗米修斯的度量数据来自PushaGateWay。

  • PushGateWay的度量数据来自JobManager/TaskManager。

  • 从JobManager/TaskManager报告给PushaGateWay的数据未由PushaGateWay缓存。

  • 因此,当promethus周期性地查询Pushgateway时,它只会得到Pushgateway当前的数据,而不是JobManager/TaskManager报告的所有数据。

我的经历似乎是这样的,但这并不是决定性的。毕竟PushGateWay必须发挥作用。当然,没有考虑Flink的度量报告器是否按预期定期报告数据

现在,我通过新的解决方案解决了差距问题,该解决方案直接从Jobmanage/Taskmanager中获取数据。

最新更新