在SRE上下文中,基于症状和原因的监测意味着什么?为什么它如此重要?哪些工具用于此类监控?
症状与原因
您的监控系统应该解决两个问题:什么坏了,为什么?
";什么坏了";表示症状;";为什么";表示(可能是中间原因(。下表列出了一些假设症状和相应的原因。
"什么";与";为什么";是以最大的信号和最小的噪声编写良好的监控。
示例
+--------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------+
| Symptom | Cause |
+--------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------+
| I’m serving HTTP 500s or 404s | Database servers are refusing connections |
|--------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------|
| My responses are slow | CPUs are overloaded by a bogosort, or an Ethernet cable is crimped under a rack, visible as partial packet loss |
|--------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------|
| Users in Antarctica aren’t receiving animated cat GIFs | Your Content Distribution Network hates scientists and felines, and thus blacklisted some client IPs |
|--------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------|
| Private content is world-readable | A new software push caused ACLs to be forgotten and allowed all requests |
+--------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------+
源
用于监视的工具取决于您的平台、要监视的内容和方式。例如,Azure Monitor是针对Azure中托管的应用程序和基础设施的,Amazon CloudWatch是针对AWS中托管的,因此列表还在继续。