HDFS 名称节点未在 Kubernetes 上正确显示数据节点列表



我正在尝试在 EKS 集群上安装 hdfs。我部署了一个名称节点和两个数据节点。全部成功启动。

但是一个奇怪的错误正在发生。当我检查Namenode GUI或检查dfsadmin客户端以获取数据节点列表时,它仅随机显示一个数据节点,即有时数据节点-0,有时数据节点-1。它从不显示两个/所有数据节点。

这里可能有什么问题?我什至对数据节点使用无头服务。请帮忙。

#clusterIP service of namenode
apiVersion: v1
kind: Service
metadata:
name: hdfs-name
namespace: pulse
labels:
app.kubernetes.io/name: hdfs-name
app.kubernetes.io/version: "1.0"
spec:
ports:
- port: 8020
protocol: TCP
name: nn-rpc
- port: 9870
protocol: TCP
name: nn-web
selector:
app.kubernetes.io/name: hdfs-name
app.kubernetes.io/version: "1.0"
type: ClusterIP
---
#namenode stateful deployment 
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: hdfs-name
namespace: pulse
labels:
app.kubernetes.io/name: hdfs-name
app.kubernetes.io/version: "1.0"
spec:
serviceName: hdfs-name
replicas: 1       #TODO 2 namenodes (1 active, 1 standby)
selector:
matchLabels:
app.kubernetes.io/name: hdfs-name
app.kubernetes.io/version: "1.0"
template:
metadata:
labels:
app.kubernetes.io/name: hdfs-name
app.kubernetes.io/version: "1.0"
spec:
initContainers:
- name: delete-lost-found
image: busybox
command: ["sh", "-c", "rm -rf /hadoop/dfs/name/lost+found"]
volumeMounts:
- name: hdfs-name-pv-claim
mountPath: /hadoop/dfs/name
containers:
- name: hdfs-name
image: bde2020/hadoop-namenode
env:
- name: CLUSTER_NAME
value: hdfs-k8s
- name: HDFS_CONF_dfs_permissions_enabled
value: "false"
#- name: HDFS_CONF_dfs_replication              #not needed
#  value: "2"  
ports:
- containerPort: 8020
name: nn-rpc
- containerPort: 9870
name: nn-web
resources:
limits:
cpu: "500m"
memory: 1Gi
requests:
cpu: "500m"
memory: 1Gi
volumeMounts:
- name: hdfs-name-pv-claim
mountPath: /hadoop/dfs/name
volumeClaimTemplates:
- metadata:
name: hdfs-name-pv-claim
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: ebs
resources:
requests:
storage: 1Gi
---
#headless service of datanode
apiVersion: v1
kind: Service
metadata:
name: hdfs-data
namespace: pulse
labels:
app.kubernetes.io/name: hdfs-data
app.kubernetes.io/version: "1.0"
spec:
ports:
ports:
- port: 9866
protocol: TCP
name: dn-rpc
- port: 9864
protocol: TCP
name: dn-web
selector:
app.kubernetes.io/name: hdfs-data
app.kubernetes.io/version: "1.0"
clusterIP: None
type: ClusterIP
---
#datanode stateful deployment
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: hdfs-data
namespace: pulse
labels:
app.kubernetes.io/name: hdfs-data
app.kubernetes.io/version: "1.0"
spec:
serviceName: hdfs-data
replicas: 2
selector:
matchLabels:
app.kubernetes.io/name: hdfs-data
app.kubernetes.io/version: "1.0"
template:
metadata:
labels:
app.kubernetes.io/name: hdfs-data
app.kubernetes.io/version: "1.0"
spec:
containers:
- name: hdfs-data
image: bde2020/hadoop-datanode
env:
- name: CORE_CONF_fs_defaultFS
value: hdfs://hdfs-name:8020
ports:           
- containerPort: 9866
name: dn-rpc
- containerPort: 9864
name: dn-web
resources:
limits:
cpu: "500m"
memory: 1Gi
requests:
cpu: "500m"
memory: 1Gi
volumeMounts:
- name: hdfs-data-pv-claim
mountPath: /hadoop/dfs/data 
volumeClaimTemplates:
- metadata:
name: hdfs-data-pv-claim
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: ebs
resources:
requests:
storage: 1Gi

运行hdfs dfsadmin -report仅随机显示一个数据节点,例如有时 datanode-0 和有时 datanode-1.
数据节点主机名是不同的 datanode-0,datanode-1,但它们的名称相同(127.0.0.1:9866(localhost((。这可能是问题所在吗?如果是,如何解决i?

另外,我没有看到任何HDFS块复制发生,甚至重复因子为3。

UPDATE
HI,结果是 Istio porxy 问题。我卸载了 Istio,它成功了。Istio 代理将名称设置为 127.0.0.1 而不是实际 IP。

我遇到了同样的问题,我目前使用的解决方法是通过将此注释添加到 Hadoop名称节点来禁用端口 9000(您的情况为 8020(上入站流量的特使重定向:

traffic.sidecar.istio.io/excludeInboundPorts: "9000"

参考: https://istio.io/v1.4/docs/reference/config/annotations/

在阅读了一些 Istio 问题后,似乎在通过 Envoy 重定向时没有保留源 IP。

相关问题:
https://github.com/istio/istio/issues/5679
https://github.com/istio/istio/pull/23275

我还没有尝试过 TPROXY 方法,因为我目前没有运行包含 TPROXY 源 ip 保留修复程序的 Istio 1.6。

结果是 Istio porxy 问题。我卸载了 Istio,它成功了。Istio 代理将名称设置为 127.0.0.1 而不是实际 IP。

最新更新