扩展GKE K8s集群破坏了网络



各位,当尝试将GKE集群从1个节点增加到3个节点时,在单独的区域(us-centra1-a,b,c(中运行。以下内容似乎很明显:

安排在新节点上的播客无法访问互联网上的资源。。。即无法连接到条带api等(可能与kube dns相关,尚未测试试图在没有dns查找的情况下离开的流量(。

同样,我无法按预期在K8s中的吊舱之间进行路由。也就是说,交叉az呼叫可能会失败?使用openvpn进行测试时,无法连接到新节点上计划的pod。

我注意到的另一个问题是Metrics服务器似乎不稳定。CCD_ 1对于新节点显示为未知。

在撰写本文时,掌握k8s版本1.15.11-gke.9

设置正在关注:

VPC-native (alias IP) - disabled
Intranode visibility - disabled

gcloud容器集群描述集群-1——区域us-central1-

clusterIpv4Cidr: 10.8.0.0/14
createTime: '2017-10-14T23:44:43+00:00'
currentMasterVersion: 1.15.11-gke.9
currentNodeCount: 1
currentNodeVersion: 1.15.11-gke.9
endpoint: 35.192.211.67
initialClusterVersion: 1.7.8
instanceGroupUrls:
- https://www.googleapis.com/compute/v1/projects/skilful-frame-180217/zones/us-central1-a/instanceGroupManagers/gke-cluster-1-default-pool-ff24932a-grp
ipAllocationPolicy: {}
labelFingerprint: a9dc16a7
legacyAbac:
enabled: true
location: us-central1-a
locations:
- us-central1-a
loggingService: none
....
masterAuthorizedNetworksConfig: {}
monitoringService: none
name: cluster-1
network: default
networkConfig:
network: .../global/networks/default
subnetwork: .../regions/us-central1/subnetworks/default
networkPolicy:
provider: CALICO
nodeConfig:
diskSizeGb: 100
diskType: pd-standard
imageType: COS
machineType: n1-standard-2
...
nodeIpv4CidrSize: 24
nodePools:
- autoscaling: {}
config:
diskSizeGb: 100
diskType: pd-standard
imageType: COS
machineType: n1-standard-2
...
initialNodeCount: 1
locations:
- us-central1-a
management:
autoRepair: true
autoUpgrade: true
name: default-pool
podIpv4CidrSize: 24
status: RUNNING
version: 1.15.11-gke.9
servicesIpv4Cidr: 10.11.240.0/20
status: RUNNING
subnetwork: default
zone: us-central1-a

下一个故障排除步骤是创建一个新的池并迁移到它。也许答案就在我眼前。。。可能是nodeIpv4CidrSizea/24吗?

谢谢!

  • 在您的问题中,集群的描述具有以下网络策略:
name: cluster-1
network: default
networkConfig:
network: .../global/networks/default
subnetwork: .../regions/us-central1/subnetworks/default
networkPolicy:
provider: CALICO
  • 我部署了一个尽可能类似的集群:
gcloud beta container --project "PROJECT_NAME" clusters create "cluster-1" 
--zone "us-central1-a" 
--no-enable-basic-auth 
--cluster-version "1.15.11-gke.9" 
--machine-type "n1-standard-1" 
--image-type "COS" 
--disk-type "pd-standard" 
--disk-size "100" 
--metadata disable-legacy-endpoints=true 
--scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" 
--num-nodes "1" 
--no-enable-ip-alias 
--network "projects/owilliam/global/networks/default" 
--subnetwork "projects/owilliam/regions/us-central1/subnetworks/default" 
--enable-network-policy 
--no-enable-master-authorized-networks 
--addons HorizontalPodAutoscaling,HttpLoadBalancing 
--enable-autoupgrade 
--enable-autorepair
  • 之后,我得到了与您相同的配置,我将指出两个部分:
addonsConfig:
networkPolicyConfig: {}
...
name: cluster-1
network: default
networkConfig:
network: projects/owilliam/global/networks/default
subnetwork: projects/owilliam/regions/us-central1/subnetworks/default
networkPolicy:
enabled: true
provider: CALICO
...
  • 在你提到的评论中,"在UI中,它说网络策略被禁用……有命令放下印花布吗?"。然后我给了你一个命令,你得到的错误是Network Policy Addon is not Enabled

这很奇怪,因为它已应用但未启用。我在我的集群上DISABLED它并查看:

addonsConfig:
networkPolicyConfig:
disabled: true
...
name: cluster-1
network: default
networkConfig:
network: projects/owilliam/global/networks/default
subnetwork: projects/owilliam/regions/us-central1/subnetworks/default
nodeConfig:
...
  • NetworkPolicyConfig{}变为disabled: true,并且kubectl top nodes0之上的部分NetworkPolicy现在不见了。因此,我建议您再次启用和禁用它,看看它是否更新了适当的资源并修复了您的网络策略问题,以下是我们将要做的:

  • 如果您的集群没有投入生产,我建议您将其调整回1,进行更改,然后再次扩展,更新会更快。但如果它正在生产中,请保持原样,但可能需要更长的时间,这取决于您的pod中断策略。(default-pool是我的集群池的名称(,我将根据我的示例调整它的大小:

$ gcloud container clusters resize cluster-1 --node-pool default-pool --num-nodes 1
Do you want to continue (Y/n)?  y
Resizing cluster-1...done.
  • 然后启用网络策略插件本身(它不会激活它,只使其可用(:
$ gcloud container clusters update cluster-1 --update-addons=NetworkPolicy=ENABLED
Updating cluster-1...done.                                                                                                                                                      
  • 并且我们启用(激活(网络策略:
$ gcloud container clusters update cluster-1 --enable-network-policy
Do you want to continue (Y/n)?  y
Updating cluster-1...done.                                                                                                                                                      
  • 现在让我们撤消它:
$ gcloud container clusters update cluster-1 --no-enable-network-policy
Do you want to continue (Y/n)?  y
Updating cluster-1...done.    
  • 禁用后,等待池就绪,然后运行最后一个命令:
$ gcloud container clusters update cluster-1 --update-addons=NetworkPolicy=DISABLED
Updating cluster-1...done.
  • 如果缩小了:
$ gcloud container clusters resize cluster-1 --node-pool default-pool --num-nodes 3
Do you want to continue (Y/n)?  y
Resizing cluster-1...done.
  • 最后再次检查描述,看看它是否匹配正确的配置,并测试pod之间的通信

以下是此配置的参考:创建群集网络策略

如果在那之后你仍然有问题,请用最新的集群描述更新你的问题,我们将进一步挖掘。