我一直在学习https://cloud.google.com/tpu/docs/how-to.
我创建了一个TPU实例,并尝试使用gcloud compute ssh
线路连接到它。然后,出现了此错误。
AppDataLocalGoogleCloud SDK>gcloud compute ssh node-1 --zone=asia-east1-c
PythonERROR: (gcloud.compute.ssh) Could not fetch resource:
- The resource 'projects/project-masker/zones/asia-east1-c/instances/node-1' was not found
在试图解决这个错误时,我发现tpus不包括在执行组中。
AppDataLocalGoogleCloud SDK>gcloud compute tpus list
PythonNAME ZONE ACCELERATOR_TYPE NETWORK RANGE STATUS
node-2 asia-east1-c v2-8 default 10.75.202.248/29 READY
node-1 asia-east1-c v2-8 default 10.82.81.168/29 READY
AppDataLocalGoogleCloud SDK>gcloud compute tpus execution-groups list
PythonListed 0 items.
这是我尝试重新启动tpu时得到的结果。
PythonRequest issued for: [node-1]
Waiting for operation [projects/project-masker/locations/asia-east1-c/operations/operation-1625299249870-5c633787137b9-
e14800b7-d997be6b] to complete...done.
done: true
metadata:
'@type': type.googleapis.com/google.cloud.common.OperationMetadata
apiVersion: v1
cancelRequested: false
createTime: '2021-07-03T08:00:49.884674545Z'
endTime: '2021-07-03T08:01:31.161199334Z'
target: projects/project-masker/locations/asia-east1-c/nodes/node-1
verb: update
name: projects/project-masker/locations/asia-east1-c/operations/operation-1625299249870-5c633787137b9-e14800b7-d997be6b
response:
'@type': type.googleapis.com/google.cloud.tpu.v1.Node
acceleratorType: v2-8
apiVersion: V1
cidrBlock: 10.82.81.168/29
createTime: '2021-07-03T07:27:41.148997156Z'
health: HEALTHY
ipAddress: 10.82.81.170
name: projects/project-masker/locations/asia-east1-c/nodes/node-1
network: global/networks/default
networkEndpoints:
- ipAddress: 10.82.81.170
port: 8470
port: '8470'
schedulingConfig: {}
serviceAccount: service-...@cloud-tpu.iam.gserviceaccount.com
state: READY
tensorflowVersion: pytorch-1.9
我试着在谷歌上找到一些相关的文章,但一篇也找不到。我该怎么解决这个问题?
您不能直接通过SSH连接到TPU节点,因此gcloud compute ssh {tpu_name}
不应该工作。
但是,您可以通过SSH直接连接到TPU VM,请参阅此链接。如果你已经在使用TPU虚拟机,那么你的问题是你正在尝试
gcloud compute ssh
而不是
gcloud alpha compute tpus tpu-vm ssh ...