Google Kubernetes引擎该节点资源不足:临时存储.这超出了它的请求0



我有一个GKE集群,我通过django创建作业,它运行我的c++代码映像,构建通过github触发。到目前为止一切都很顺利。然而,我最近向github提交了一个新的提交(这是一个非常小的更改,就像三四行基本操作一样(,它像往常一样构建了一个映像。但这一次,当我试图通过简单的作业创建作业时,它说Pod errors: BackoffLimitExceeded, Error with exit code 137,但作业没有完成。

我深入研究了这个问题,通过运行kubectl describe POD_NAME,我从一个失败的吊舱中得到了这个输出:

Conditions:
Type              Status
Initialized       True
Ready             False
ContainersReady   False
PodScheduled      True
Volumes:
kube-api-access-nqgnl:
Type:                    Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds:  3607
ConfigMapName:           kube-root-ca.crt
ConfigMapOptional:       <nil>
DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type     Reason               Age    From               Message
----     ------               ----   ----               -------
Normal   Scheduled            7m32s  default-scheduler  Successfully assigned default/xvb8zfzrhhmz-jk9vf to gke-cluster-1-default-pool-ee7e99bb-xzhk
Normal   Pulling              7m7s   kubelet            Pulling image "gcr.io/videoo3-360019/github.com/videoo-io/videoo-render:latest"
Normal   Pulled               4m1s   kubelet            Successfully pulled image "gcr.io/videoo3-360019/github.com/videoo-io/videoo-render:latest" in 3m6.343917225s
Normal   Created              4m1s   kubelet            Created container jobcontainer
Normal   Started              4m     kubelet            Started container jobcontainer
Warning  Evicted              3m29s  kubelet            The node was low on resource: ephemeral-storage. Container jobcontainer was using 91144Ki, which exceeds its request of 0.
Normal   Killing              3m29s  kubelet            Stopping container jobcontainer
Warning  ExceededGracePeriod  3m19s  kubelet            Container runtime did not kill the pod within specified grace period.

错误发生是因为以下行:

The node was low on resource: ephemeral-storage. Container jobcontainer was using 91144Ki, which exceeds its request of 0.

我没有一个yaml文件来设置我的pod信息,相反,我做了一个django调用句柄配置,看起来像这样:

def kube_create_job_object(name, container_image, namespace="default", container_name="jobcontainer", env_vars={}):
# Body is the object Body
body = client.V1Job(api_version="batch/v1", kind="Job")
# Body needs Metadata
# Attention: Each JOB must have a different name!
body.metadata = client.V1ObjectMeta(namespace=namespace, name=name)
# And a Status
body.status = client.V1JobStatus()
# Now we start with the Template...
template = client.V1PodTemplate()
template.template = client.V1PodTemplateSpec()
# Passing Arguments in Env:
env_list = []
for env_name, env_value in env_vars.items():
env_list.append( client.V1EnvVar(name=env_name, value=env_value) )
print(env_list)
security = client.V1SecurityContext(privileged=True, allow_privilege_escalation=True, capabilities= client.V1Capabilities(add=["CAP_SYS_ADMIN"]))
container = client.V1Container(name=container_name, image=container_image, env=env_list, stdin=True, security_context=security)
template.template.spec = client.V1PodSpec(containers=[container], restart_policy='Never')
body.spec = client.V1JobSpec(backoff_limit=0, ttl_seconds_after_finished=600, template=template.template)
return body

def kube_create_job(manifest, output_uuid, output_signed_url, webhook_url, valgrind, sleep, isaudioonly):
credentials, project = google.auth.default(
scopes=['https://www.googleapis.com/auth/cloud-platform', ])
credentials.refresh(google.auth.transport.requests.Request())
cluster_manager = ClusterManagerClient(credentials=credentials)
cluster = cluster_manager.get_cluster(name=f"path/to/cluster")
with NamedTemporaryFile(delete=False) as ca_cert:
ca_cert.write(base64.b64decode(cluster.master_auth.cluster_ca_certificate))
config = client.Configuration()
config.host = f'https://{cluster.endpoint}:443'
config.verify_ssl = True
config.api_key = {"authorization": "Bearer " + credentials.token}
config.username = credentials._service_account_email
config.ssl_ca_cert = ca_cert.name
client.Configuration.set_default(config)
# Setup K8 configs
api_instance = kubernetes.client.BatchV1Api(kubernetes.client.ApiClient(config))
container_image = get_first_success_build_from_list_builds(client)
name = id_generator()
body = kube_create_job_object(name, container_image,
env_vars={
"PROJECT"           : json.dumps(manifest),
"BUCKET"            : settings.GS_BUCKET_NAME,
})
try:
api_response = api_instance.create_namespaced_job("default", body, pretty=True)
print(api_response)
except ApiException as e:
print("Exception when calling BatchV1Api->create_namespaced_job: %sn" % e)
return body

造成这种情况的原因是什么?我该如何解决?我应该将资源/限制变量设置为一个值吗?如果是,我如何在django工作调用中做到这一点?

看起来实际节点本身的存储空间不足。由于您的作业规范没有对临时存储的请求,因此它被安排在任何节点上,在这种情况下,该特定节点似乎没有足够的可用存储。

我不是Python专家,但看起来你应该能够做一些事情,比如:

storage_size = SOME_VALUE
requests = {'ephemeral-storage': storage_size}
resources = client.V1ResourceRequirements(requests=requests)
container = client.V1Container(name=container_name, image=container_image, env=env_list, stdin=True, security_context=security, resources=resources)

最新更新