在企业防火墙/代理服务器后面运行 kubernetes kubeadm 集群

我们有一个 5 节点集群，它被移到了我们的公司防火墙/代理服务器后面。

按照这里的指示：设置独立-kubernetes-集群-后面-企业-代理

我使用以下方法设置代理服务器环境变量：

export http_proxy=http://proxy-host:proxy-port/
export HTTP_PROXY=$http_proxy
export https_proxy=$http_proxy
export HTTPS_PROXY=$http_proxy
printf -v lan '%s,' localip_of_machine
printf -v pool '%s,' 192.168.0.{1..253}
printf -v service '%s,' 10.96.0.{1..253}
export no_proxy="${lan%,},${service%,},${pool%,},127.0.0.1";
export NO_PROXY=$no_proxy

现在，我们集群中的所有内容都在内部工作。但是，当我尝试创建一个从外部拉下图像的 pod 时，pod 卡在ContainerCreating上，例如，

[gms@thalia0 ~]$ kubectl apply -f https://k8s.io/examples/admin/dns/busybox.yaml
pod/busybox created

卡在这里：

[gms@thalia0 ~]$ kubectl get pods
NAME                            READY   STATUS              RESTARTS   AGE
busybox                         0/1     ContainerCreating   0          17m

我认为这是由于主机/域正在从我们的公司代理规则中拉取图像。我们确实有规则

k8s.io
kubernetes.io
docker.io
docker.com

所以，我不确定需要添加哪些其他主机/域。

我为 busybox 做了一个描述 pods 并查看了对node.kubernetes.io的引用(我正在为*.kubernetes.io输入一个域范围的例外，希望这就足够了)。

这是我从kubectl describe pods busybox得到的：

Volumes:
default-token-2kfbw:
Type:        Secret (a volume populated by a Secret)
SecretName:  default-token-2kfbw
Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type     Reason                  Age   From                          Message
----     ------                  ----  ----                          -------
Normal   Scheduled               73s   default-scheduler             Successfully assigned default/busybox to thalia3.ahc.umn.edu
Warning  FailedCreatePodSandBox  10s   kubelet, thalia3.ahc.umn.edu  Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "6af48c5dadf6937f9747943603a3951bfaf25fe1e714cb0b0cbd4ff2d59aa918" network for pod "busybox": NetworkPlugin cni failed to set up pod "busybox_default" network: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout, failed to clean up sandbox container "6af48c5dadf6937f9747943603a3951bfaf25fe1e714cb0b0cbd4ff2d59aa918" network for pod "busybox": NetworkPlugin cni failed to teardown pod "busybox_default" network: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout]
Normal   SandboxChanged          10s   kubelet, thalia3.ahc.umn.edu  Pod sandbox changed, it will be killed and re-created.

我认为印花布错误是由于以下原因：

Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s

calico和corednspod 似乎有类似的错误到达node.kubernetes.io，所以我认为这是由于我们的服务器无法在重新启动时拉下新映像。

看起来你误解了一些 Kubernetes 概念，我想在这里帮助澄清。对node.kubernetes.io的引用不是尝试对该域进行任何网络调用。它只是 Kubernetes 用来指定字符串键的约定。因此，如果您必须应用标签、注释或容许，您可以定义自己的键，例如subdomain.domain.tld/some-key.

至于您遇到的印花布问题，它看起来像错误：

network: error getting ClusterInformation: Get https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default: dial tcp 10.96.0.1:443: i/o timeout]

是我们这里的罪魁祸首。10.96.0.1是用于引用 Pod 中的 Kubernetes API 服务器的 IP 地址。节点上运行的calico/nodePod 似乎无法访问 API 服务器。您能否提供有关如何设置 Calico 的更多背景信息？您知道您运行的是什么版本的印花布吗？

您的calico/node实例正在尝试访问crd.projectcalico.org/v1/clusterinformations资源这一事实告诉我，它正在使用 Kubernetes 数据存储作为其后端。您确定您没有尝试在 Etcd 模式下运行 Calico 吗？

拉取图像似乎没有任何问题，因为您应该看到ImagePullBackOff状态。(尽管这可能会在您看到的错误消息之后出现)

您从 Pod 中看到的错误与它们无法在内部连接到 kube-apiserver 有关。它看起来像超时，因此很可能默认命名空间中的kubernetes服务存在某些内容。您可以像这样检查它，例如：

$ kubectl -n default get svc
NAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
kubernetes   ClusterIP   10.96.0.1    <none>        443/TCP   2d20h

可能是缺少(？您可以随时重新创建它：

$ cat <<'EOF' | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
labels:
component: apiserver
provider: kubernetes
name: kubernetes
namespace: default
spec:
clusterIP: 10.96.0.1
type: ClusterIP
ports:
- name: https
port: 443
protocol: TCP
targetPort: 443
EOF

容忍基本上是说 pod 可以容忍在具有node.kubernetes.io/not-ready:NoExecute和node.kubernetes.io/unreachable:NoExecute污点的节点上调度，但您的错误看起来与此无关。

该问题通常意味着 docker 守护程序无法响应。

如果有任何其他服务消耗更多的 CPU 或 I/O，则可能会出现此问题。

相关内容

最新更新

热门标签：