根据文档部署Prometheus -operator后,我发现kubectl top Nodes
无法正常运行
$ kubectl get apiService v1beta1.metrics.k8s.io
v1beta1.metrics.k8s.io monitoring/prometheus-adapter False (FailedDiscoveryCheck) 44m
$ kubectl top nodes
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get nodes.metrics.k8s.io)
$ kubectl get --raw "/apis/metrics.k8s.io/v1beta1"
Error from server (ServiceUnavailable): the server is currently unable to handle the request
prometheus-adapter.yaml
...
- args:
- --cert-dir=/var/run/serving-cert
- --config=/etc/adapter/config.yaml
- --logtostderr=true
- --metrics-relist-interval=1m
- --prometheus-url=http://prometheus-k8s.monitoring.svc.cluster.local:9090/prometheus
- --secure-port=6443
...
当我在寻找一个问题时,我通过将hostNetwork: true
添加到配置文件中找到了一个解决方案(#1060)。
当我认为解决方案成功时,我发现kubectl top nodes
仍然不起作用。
$ kubectl get apiService v1beta1.metrics.k8s.io
v1beta1.metrics.k8s.io monitoring/prometheus-adapter True 64m
$ kubectl top nodes
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get nodes.metrics.k8s.io)
$ kubectl get --raw "/apis/metrics.k8s.io/v1beta1"
{"kind":"APIResourceList","apiVersion":"v1","groupVersion":"metrics.k8s.io/v1beta1","resources":[{"name":"nodes","singularName":"","namespaced":false,"kind":"NodeMetrics","verbs":["get","list"]},{"name":"pods","singularName":"","namespaced":true,"kind":"PodMetrics","verbs":["get","list"]}]}
查看Prometheus-adapter日志
E0812 10:03:02.469561 1 provider.go:265] failed querying node metrics: unable to fetch node CPU metrics: unable to execute query: Get "http://prometheus-k8s.monitoring.svc.cluster.local:9090/prometheus/api/v1/query?query=sum+by+%28node%29+%28%0A++1+-+irate%28%0A++++node_cpu_seconds_total%7Bmode%3D%22idle%22%7D%5B60s%5D%0A++%29%0A++%2A+on%28namespace%2C+pod%29+group_left%28node%29+%28%0A++++node_namespace_pod%3Akube_pod_info%3A%7Bnode%3D~%22node02.whisper-tech.net%7Cnode03.whisper-tech.net%22%7D%0A++%29%0A%29%0Aor+sum+by+%28node%29+%28%0A++1+-+irate%28%0A++++windows_cpu_time_total%7Bmode%3D%22idle%22%2C+job%3D%22windows-exporter%22%2Cnode%3D~%22node02.whisper-tech.net%7Cnode03.whisper-tech.net%22%7D%5B4m%5D%0A++%29%0A%29%0A&time=1628762582.467": dial tcp: lookup prometheus-k8s.monitoring.svc.cluster.local on 100.100.2.136:53: no such host
问题的原因是Prometheus-Adapter
中添加了hostNetwork: true
,导致pod无法通过coreDNS
访问集群中的Prometheus-K8s
。
我想到的一个想法是让Kubernetes nodes
通过coreDNS
访问集群的内部部分
有更好的方法来解决当前的问题吗?我该怎么办?
您的pod使用hostNetwork
运行,因此您应该显式设置其DNS策略"ClusterFirstWithHostNet"如Pod的DNS策略文档所述:
"ClusterFirstWithHostNet":对于使用hostNetwork运行的pod,您应该显式设置其DNS策略"ClusterFirstWithHostNet"
我创建了一个简单的例子来说明它是如何工作的。
首先,我创建了app-1
Pod与hostNetwork: true
:
$ cat app-1.yml
kind: Pod
apiVersion: v1
metadata:
name: app-1
spec:
hostNetwork: true
containers:
- name: dnsutils
image: gcr.io/kubernetes-e2e-test-images/dnsutils:1.3
command:
- sleep
- "3600"
$ kubectl apply -f app-1.yml
pod/app-1 created
我们可以测试app-1
不能解析,例如:kubernetes.default.svc
:
$ kubectl exec -it app-1 -- sh
/ # nslookup kubernetes.default.svc
Server: 169.254.169.254
Address: 169.254.169.254#53
** server can't find kubernetes.default.svc: NXDOMAIN
让我们将dnsPolicy: ClusterFirstWithHostNet
添加到app-1
Pod并重新创建它:
$ cat app-1.yml
kind: Pod
apiVersion: v1
metadata:
name: app-1
spec:
hostNetwork: true
dnsPolicy: ClusterFirstWithHostNet
containers:
- name: dnsutils
image: gcr.io/kubernetes-e2e-test-images/dnsutils:1.3
command:
- sleep
- "3600"
$ kubectl delete pod app-1 && kubectl apply -f app-1.yml
pod "app-1" deleted
pod/app-1 created
最后,我们可以检查app-1
Pod是否能够解析kubernetes.default.svc
:
$ kubectl exec -it app-1 -- sh
/ # nslookup kubernetes.default.svc
Server: 10.8.0.10
Address: 10.8.0.10#53
Name: kubernetes.default.svc.cluster.local
Address: 10.8.0.1
正如您在上面的示例中所看到的,ClusterFirstWithHostNet
dnsppolicy的一切都如预期的那样工作。
有关更多信息,请参阅DNS For Services和Pods文档。