GKE Autopilot Guardian与图中cpu资源请求不兼容



我有一个GKE私有集群,在自动驾驶模式下运行gke1.23,如下所述。我正试图安装一个应用程序从供应商的掌舵图,按照他们的说明,我使用这样的脚本:

#! /bin/bash
helm repo add safesoftware https://safesoftware.github.io/helm-charts/
helm repo update
tag="2021.2"
version="safesoftware/fmeserver-$tag"
helm upgrade --install 
fmeserver   
$version  
--set fmeserver.image.tag=$tag 
--set deployment.hostname="REDACTED" 
--set deployment.useHostnameIngress=true 
--set deployment.tlsSecretName="my-ssl-cert" 
--namespace ingress-nginx --create-namespace 
#--set resources.core.requests.cpu="500m" 
#--set resources.queue.requests.cpu="500m" 

然而,我从GKE典狱长那里得到错误!

Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "ingress-nginx" chart repository
...Successfully got an update from the "safesoftware" chart repository
Update Complete. ⎈Happy Helming!⎈
W1201 10:25:08.117532   29886 warnings.go:70] Autopilot increased resource requests for Deployment ingress-nginx/engine-standard-group to meet requirements. See http://g.co/gke/autopilot-resources.
W1201 10:25:08.201656   29886 warnings.go:70] Autopilot increased resource requests for StatefulSet ingress-nginx/fmeserver-postgresql to meet requirements. See http://g.co/gke/autopilot-resources.
W1201 10:25:08.304755   29886 warnings.go:70] Autopilot increased resource requests for StatefulSet ingress-nginx/core to meet requirements. See http://g.co/gke/autopilot-resources.
W1201 10:25:08.392965   29886 warnings.go:70] Autopilot increased resource requests for StatefulSet ingress-nginx/queue to meet requirements. See http://g.co/gke/autopilot-resources.
W1201 10:25:08.480421   29886 warnings.go:70] Autopilot increased resource requests for StatefulSet ingress-nginx/websocket to meet requirements. See http://g.co/gke/autopilot-resources.
Error: UPGRADE FAILED: cannot patch "core" with kind StatefulSet: admission webhook "gkepolicy.common-webhooks.networking.gke.io" denied the request: GKE Warden rejected the request because it violates one or more policies: {"[denied by autogke-pod-limit-constraints]":["workload 'core' cpu requests '{{400 -3} {u003cnilu003e}  DecimalSI}' is lower than the Autopilot minimum required of '{{500 -3} {u003cnilu003e} 500m DecimalSI}' for using pod anti affinity. Requested by user: 'REDACTED', groups: 'system:authenticated'."]} && cannot patch "queue" with kind StatefulSet: admission webhook "gkepolicy.common-webhooks.networking.gke.io" denied the request: GKE Warden rejected the request because it violates one or more policies: {"[denied by autogke-pod-limit-constraints]":["workload 'queue' cpu requests '{{250 -3} {u003cnilu003e}  DecimalSI}' is lower than the Autopilot minimum required of '{{500 -3} {u003cnilu003e} 500m DecimalSI}' for using pod anti affinity. Requested by user: 'REDACTED', groups: 'system:authenticated'."]}

所以我在资源规范中修改了导致问题的pod的cpu请求,一种方法是取消脚本的最后两行注释。

--set resources.core.requests.cpu="500m" 
--set resources.queue.requests.cpu="500m" 

这允许我安装或升级图表,但随后我得到podunscheable,原因Cannot schedule pods: Insufficient cpu。根据图表的确切变化,我有时也会看到Cannot schedule pods: node(s) had volume node affinity conflict

我看不出如何在自动驾驶模式下增加每个(e2-medium)节点的数量或大小。我也找不到办法除掉那些守卫。我检查了配额,没有发现任何配额问题。我可以安装其他工作负载,包括ingress-nginx。

我不确定问题是什么,我不是helm或Kubernetes的专家。

作为参考,集群可以描述为:

addonsConfig:
cloudRunConfig:
disabled: true
loadBalancerType: LOAD_BALANCER_TYPE_EXTERNAL
configConnectorConfig: {}
dnsCacheConfig:
enabled: true
gcePersistentDiskCsiDriverConfig:
enabled: true
gcpFilestoreCsiDriverConfig:
enabled: true
gkeBackupAgentConfig: {}
horizontalPodAutoscaling: {}
httpLoadBalancing: {}
kubernetesDashboard:
disabled: true
networkPolicyConfig:
disabled: true
autopilot:
enabled: true
autoscaling:
autoprovisioningNodePoolDefaults:
imageType: COS_CONTAINERD
management:
autoRepair: true
autoUpgrade: true
oauthScopes:
- https://www.googleapis.com/auth/devstorage.read_only
- https://www.googleapis.com/auth/logging.write
- https://www.googleapis.com/auth/monitoring
- https://www.googleapis.com/auth/service.management.readonly
- https://www.googleapis.com/auth/servicecontrol
- https://www.googleapis.com/auth/trace.append
serviceAccount: default
upgradeSettings:
maxSurge: 1
strategy: SURGE
autoscalingProfile: OPTIMIZE_UTILIZATION
enableNodeAutoprovisioning: true
resourceLimits:
- maximum: '1000000000'
resourceType: cpu
- maximum: '1000000000'
resourceType: memory
- maximum: '1000000000'
resourceType: nvidia-tesla-t4
- maximum: '1000000000'
resourceType: nvidia-tesla-a100
binaryAuthorization: {}
clusterIpv4Cidr: 10.102.0.0/21
createTime: '2022-11-30T04:47:19+00:00'
currentMasterVersion: 1.23.12-gke.100
currentNodeCount: 7
currentNodeVersion: 1.23.12-gke.100
databaseEncryption:
state: DECRYPTED
defaultMaxPodsConstraint:
maxPodsPerNode: '110'
endpoint: REDACTED
id: REDACTED
initialClusterVersion: 1.23.12-gke.100
initialNodeCount: 1
instanceGroupUrls: REDACTED
ipAllocationPolicy:
clusterIpv4Cidr: 10.102.0.0/21
clusterIpv4CidrBlock: 10.102.0.0/21
clusterSecondaryRangeName: pods
servicesIpv4Cidr: 10.103.0.0/24
servicesIpv4CidrBlock: 10.103.0.0/24
servicesSecondaryRangeName: services
stackType: IPV4
useIpAliases: true
labelFingerprint: '05525394'
legacyAbac: {}
location: europe-west3
locations:
- europe-west3-c
- europe-west3-a
- europe-west3-b
loggingConfig:
componentConfig:
enableComponents:
- SYSTEM_COMPONENTS
- WORKLOADS
loggingService: logging.googleapis.com/kubernetes
maintenancePolicy:
resourceVersion: 93731cbd
window:
dailyMaintenanceWindow:
duration: PT4H0M0S
startTime: 03:00
masterAuth:
masterAuthorizedNetworksConfig:
cidrBlocks:
enabled: true
monitoringConfig:
componentConfig:
enableComponents:
- SYSTEM_COMPONENTS
monitoringService: monitoring.googleapis.com/kubernetes
name: gis-cluster-uat
network: geo-nw-uat
networkConfig:
nodeConfig:
diskSizeGb: 100
diskType: pd-standard
imageType: COS_CONTAINERD
machineType: e2-medium
metadata:
disable-legacy-endpoints: 'true'
oauthScopes:
- https://www.googleapis.com/auth/devstorage.read_only
- https://www.googleapis.com/auth/logging.write
- https://www.googleapis.com/auth/monitoring
- https://www.googleapis.com/auth/service.management.readonly
- https://www.googleapis.com/auth/servicecontrol
- https://www.googleapis.com/auth/trace.append
serviceAccount: default
shieldedInstanceConfig:
enableIntegrityMonitoring: true
enableSecureBoot: true
workloadMetadataConfig:
mode: GKE_METADATA
nodePoolAutoConfig: {}
nodePoolDefaults:
nodeConfigDefaults:
loggingConfig:
variantConfig:
variant: DEFAULT
nodePools:
- autoscaling:
autoprovisioned: true
enabled: true
maxNodeCount: 1000
config:
diskSizeGb: 100
diskType: pd-standard
imageType: COS_CONTAINERD
machineType: e2-medium
metadata:
disable-legacy-endpoints: 'true'
oauthScopes:
- https://www.googleapis.com/auth/devstorage.read_only
- https://www.googleapis.com/auth/logging.write
- https://www.googleapis.com/auth/monitoring
- https://www.googleapis.com/auth/service.management.readonly
- https://www.googleapis.com/auth/servicecontrol
- https://www.googleapis.com/auth/trace.append
serviceAccount: default
shieldedInstanceConfig:
enableIntegrityMonitoring: true
enableSecureBoot: true
workloadMetadataConfig:
mode: GKE_METADATA
initialNodeCount: 1
instanceGroupUrls:
locations:
management:
autoRepair: true
autoUpgrade: true
maxPodsConstraint:
maxPodsPerNode: '32'
name: default-pool
networkConfig:
podIpv4CidrBlock: 10.102.0.0/21
podRange: pods
podIpv4CidrSize: 26
selfLink: REDACTED
status: RUNNING
upgradeSettings:
maxSurge: 1
strategy: SURGE
version: 1.23.12-gke.100
- autoscaling:
autoprovisioned: true
enabled: true
maxNodeCount: 1000
config:
diskSizeGb: 100
diskType: pd-standard
imageType: COS_CONTAINERD
machineType: e2-standard-2
metadata:
disable-legacy-endpoints: 'true'
oauthScopes:
- https://www.googleapis.com/auth/devstorage.read_only
- https://www.googleapis.com/auth/logging.write
- https://www.googleapis.com/auth/monitoring
- https://www.googleapis.com/auth/service.management.readonly
- https://www.googleapis.com/auth/servicecontrol
- https://www.googleapis.com/auth/trace.append
reservationAffinity:
consumeReservationType: NO_RESERVATION
serviceAccount: default
shieldedInstanceConfig:
enableIntegrityMonitoring: true
enableSecureBoot: true
workloadMetadataConfig:
mode: GKE_METADATA
instanceGroupUrls:
locations:
management:
autoRepair: true
autoUpgrade: true
maxPodsConstraint:
maxPodsPerNode: '32'
name: nap-1rrw9gqf
networkConfig:
podIpv4CidrBlock: 10.102.0.0/21
podRange: pods
podIpv4CidrSize: 26
selfLink: REDACTED
status: RUNNING
upgradeSettings:
maxSurge: 1
strategy: SURGE
version: 1.23.12-gke.100
notificationConfig:
pubsub: {}
privateClusterConfig:
enablePrivateNodes: true
masterGlobalAccessConfig:
enabled: true
masterIpv4CidrBlock: 192.168.0.0/28
peeringName: gke-nf69df7b6242412e9932-582a-f600-peer
privateEndpoint: 192.168.0.2
publicEndpoint: REDACTED
releaseChannel:
channel: REGULAR
resourceLabels:
environment: uat
selfLink: REDACTED
servicesIpv4Cidr: 10.103.0.0/24
shieldedNodes:
enabled: true
status: RUNNING
subnetwork: redacted
verticalPodAutoscaling:
enabled: true
workloadIdentityConfig:
workloadPool: REDACTED
zone: europe-west3

编辑添加pod description日志。

kubectl description pod core -n ingress-nginx

...
Events:
Type     Reason     Age                        From     Message
----     ------     ----                       ----     -------
Warning  Unhealthy  6m49s (x86815 over 3d22h)  kubelet  Readiness probe failed: HTTP probe failed with statuscode: 500
Warning  BackOff    110s (x13994 over 3d23h)   kubelet  Back-off restarting failed container

kubectl description pod queue -n ingress-nginx

...
Events:
Type     Reason             Age                        From                                   Message
----     ------             ----                       ----                                   -------
Normal   NotTriggerScaleUp  9m29s (x18130 over 2d14h)  cluster-autoscaler                     pod didn't trigger scale-up (it wouldn't fit if a new node is added): 2 node(s) didn't match pod affinity rules, 3 node(s) had volume node affinity conflict
Normal   NotTriggerScaleUp  4m28s (x24992 over 2d14h)  cluster-autoscaler                     pod didn't trigger scale-up (it wouldn't fit if a new node is added): 3 node(s) had volume node affinity conflict, 2 node(s) didn't match pod affinity rules
Warning  FailedScheduling   3m33s (x3385 over 2d14h)   gke.io/optimize-utilization-scheduler  0/7 nodes are available: 1 node(s) had volume node affinity conflict, 6 Insufficient cpu.

一段时间后,我用以下策略解决了这些调度问题。

如果你看到:

Cannot schedule pods: Insufficient cpu. 

这意味着你需要设置Pod的CPU请求来匹配自动驾驶。

如果您找不到适合您的部署的CPU设置,请考虑将pods计算类更改为Balanced。

如果你看到:

volume node affinity conflict, 

请记住,自动驾驶集群是区域性的(不是分区的),大多数存储类型要么是分区的,要么是冗余的,只在两个区域运行。您的区域可能有两个以上的区域,每个区域中的一个pod需要存储。为了解决这个问题,我设置了一个NFS (Google Filestore),这是非常昂贵的。另一种方法是将部署配置为只调度区域存储所在区域中的pod——这样可以减少冗余并降低成本。

相关内容

  • 没有找到相关文章

最新更新