我们将Kubernetes集群(运行在GKE上)从1.19版本升级到1.21版本,从那时起我们就无法部署我们的部署。部署的相关部分定义如下:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-deployment
labels:
name: my-deployment
spec:
replicas: 2
revisionHistoryLimit: 10
strategy:
type: "RollingUpdate"
rollingUpdate:
maxUnavailable: 0
maxSurge: 1
selector:
matchLabels:
name: "my-deployment"
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: name
operator: In
values:
- my-deployment
- my-other-deployment
topologyKey: "kubernetes.io/hostname"
nodeSelector:
cloud.google.com/gke-nodepool: somenodepool
...
我们正在运行一个5节点集群和"my-other-deployment";只有一个豆荚的复制品。因此,在开始推出流程之前,应该有两个节点可用于调度新的"my-deployment"。豆荚。多年来,这种方法一直运行良好,但在将服务器升级到v1.21.10-gke版本之后。2000年,推出过程现在失败了:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 50s (x2 over 52s) default-scheduler 0/5 nodes are available: 1 Insufficient cpu, 1 node(s) didn't satisfy existing pods anti-affinity rules, 3 node(s) didn't match pod anti-affinity rules, 4 node(s) didn't match pod affinity/anti-affinity rules.
Normal NotTriggerScaleUp 50s cluster-autoscaler pod didn't trigger scale-up:
Normal Scheduled 20s default-scheduler Successfully assigned default/my-deployment-7f66984b9f-bqs8l to gke-v1-21-10-gke-2000-n1-standar-9b2c965a-lz4j
Normal Pulled 19s kubelet Container image "somerepo/something/my-deployment:589" already present on machine
Normal Created 19s kubelet Created container my-deployment
Normal Started 19s kubelet Started container my-deployment
这可能是什么原因,我们如何解决它?
我不知道1.19和1.21在(反)亲和力方面有什么变化。也许检查:
- 是否有其他具有相同名称的部署,触发反关联?
- 节点池名称是否正确?
- 节点池中的所有节点都是可调度的吗?
问题是剩余节点上没有足够的CPU可用来满足pod的CPU资源请求。强制执行的方式可能在1.20或1.21中发生了变化,因为它之前不是一个问题。