如何正确标记和配置Kubernetes以使用Nvidia GPU

我有一个内部K8s集群在裸金属上运行。在我的一个工作节点上，我有4个GPU，我想配置K8来识别和使用这些GPU。根据官方文档，我安装了所有需要的东西，现在当我运行时：

docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi

Tue Nov 12 09:20:20 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  On   | 00000000:02:00.0 Off |                  N/A |
| 29%   25C    P8     2W / 250W |      0MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  On   | 00000000:03:00.0 Off |                  N/A |
| 29%   25C    P8     1W / 250W |      0MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce RTX 208...  On   | 00000000:82:00.0 Off |                  N/A |
| 29%   26C    P8     2W / 250W |      0MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce RTX 208...  On   | 00000000:83:00.0 Off |                  N/A |
| 29%   26C    P8    12W / 250W |      0MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

我知道我必须给节点贴标签，这样K8才能识别这些GPU，但我在官方文档中找不到正确的标签。在文档上，我只看到了这个：

# Label your nodes with the accelerator type they have.
kubectl label nodes <node-with-k80> accelerator=nvidia-tesla-k80

在另一个教程(仅针对谷歌云(中，我发现了以下内容：

aliyun.accelerator/nvidia_count=1                          #This field is important.
aliyun.accelerator/nvidia_mem=12209MiB
aliyun.accelerator/nvidia_name=Tesla-M40

那么，给我的节点贴标签的正确方法是什么呢？我是否还需要用GPU的数量和内存大小来标记它？

我看到您正试图确保您的pod在具有GPU 的节点上得到调度

最简单的方法是用GPU标记节点，如下所示：

kubectl label node <node_name> has_gpu=true

然后创建您的pod添加用has_gpu: true验证的nodeSelector。通过这种方式，pod将仅在具有GPU的节点上进行调度。在k8s文档中阅读更多

它唯一的问题是，在这种情况下，调度器不知道节点上有多少GPU，并且可以在只有4个GPU的节点上调度4个以上的pod。

更好的选择是使用节点扩展资源

它看起来如下：

运行kubectl proxy

补丁节点资源配置：

curl --header "Content-Type: application/json-patch+json" 
--request PATCH 
--data '[{"op": "add", "path": "/status/capacity/example.com~1gpu", "value": "4"}]' 
http://localhost:8001/api/v1/nodes/<your-node-name>/status

为吊舱分配扩展器资源

apiVersion: v1
kind: Pod
metadata:
name: extended-resource-demo
spec:
containers:
- name: extended-resource-demo-ctr
image: my_pod_name
resources:
requests:
example.com/gpu: 1
limits:
example.com/gpu: 1

在这种情况下，调度器知道节点上有多少GPU可用，如果不能满足请求，则不会调度更多的pod。

相关内容

最新更新

热门标签：