k8s调用GPU
1. 使用设备插件
调度 GPUs | Kubernetes 官方介绍
Kubernetes 实现了 Device Plugins 以允许 Pod 访问类似 GPU 这类特殊的硬件功能特性。作为运维管理人员,你要在节点上安装来自对应硬件厂商的 GPU 驱动程序,并运行来自 GPU 厂商的对应的设备插件。
当以上条件满足时,Kubernetes 将暴露 amd.com/gpu 或 nvidia.com/gpu 为可调度的资源,可以通过请求 <vendor>.com/gpu 资源来使用 GPU 设备。不过,使用 GPU 时,在如何指定资源需求这个方面还是有一些限制的:
GPUs只能设置在limits部分,这意味着:- 不可以仅指定
requests而不指定limits - 可以同时指定
limits和requests,不过这两个值必须相等 - 可以指定
GPU的limits而不指定其requests,K8S将使用限制值作为默认的请求值
- 不可以仅指定
- 容器(
Pod)之间是不共享GPU的,GPU也不可以过量分配 - 每个容器可以请求一个或者多个
GPU,但是用小数值来请求部分GPU是不允许的
# need 2 GPUs
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
containers:
- name: cuda-container
image: nvcr.io/nvidia/cuda:9.0-devel
resources:
limits:
nvidia.com/gpu: 2
- name: digits-container
image: nvcr.io/nvidia/digits:20.12-tensorflow-py3
resources:
limits:
nvidia.com/gpu: 2
2. 部署 AMD GPU 设备插件
节点需要使用 AMD 的 GPU 资源的话,需要先安装 k8s-device-plugin 这个插件,并且需要 K8S 节点必须预先安装 AMD GPU 的 Linux 驱动。
# 安装显卡插件 $ kubectl create -f https://raw.githubusercontent.com/RadeonOpenCompute/k8s-device-plugin/r1.10/k8s-ds-amdgpu-dp.yaml
3. 部署 NVIDIA GPU 设备插件
节点需要使用 NVIDIA 的 GPU 资源的话,需要先安装 k8s-device-plugin 这个插件,并且需要事先满足下面的条件:
Kubernetes的节点必须预先安装了NVIDIA驱动Kubernetes的节点必须预先安装nvidia-docker2.0Docker的默认运行时必须设置为nvidia-container-runtime,而不是runcNVIDIA驱动版本大于或者等于384.81版本
# 安装nvidia-docker2.0工具
$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
$ curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
$ curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
$ sudo apt-get update && sudo apt-get install -y nvidia-docker2
$ sudo systemctl restart docker
# 安装nvidia-container-runtime运行时
$ cat /etc/docker/daemon.json
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
# 安装显卡插件
$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml
heml安装
# 也可以使用helm安装
$ helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
$ helm repo update
$ helm install --version=0.9.0 --generate-name nvdp/nvidia-device-plugin
# 也可以使用docker安装
$ docker run -it \
--security-opt=no-new-privileges \
--cap-drop=ALL --network=none \
-v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins \
nvcr.io/nvidia/k8s-device-plugin:devel
显卡插件,就是在我们通过在配置文件里面指定如下字段之后,启动 pod 的时候,系统给为我们的服务分配对应需要数量的显卡数量,让我们的程序可以使用显卡资源。
amd.com/gpunvidia.com/gpu
需要注意的是,第一次安装显卡驱动的话,是不用重启服务器的,后续更新驱动版本的话,则是需要的。但是建议第一次安装驱动之后,最好还是重启下,防止意外情况的出现和发生。
4、验证
1、添加标签
# kubectl label nodes 192.168.1.56 nvidia.com/gpu.present=true root@hello:~# kubectl get nodes -L nvidia.com/gpu.present NAME STATUS ROLES AGE VERSION GPU.PRESENT 192.168.1.55 Ready,SchedulingDisabled master 128m v1.22.2 192.168.1.56 Ready node 127m v1.22.2 true
2、安装helm仓库
root@hello:~# curl https://baltocdn.com/helm/signing.asc | sudo apt-key add -
root@hello:~# sudo apt-get install apt-transport-https --yes
root@hello:~# echo "deb https://baltocdn.com/helm/stable/debian/ all main" | sudo tee /etc/apt/sources.list.d/helm-stable-debian.list
root@hello:~# sudo apt-get update
root@hello:~# sudo apt-get install helm
helm install \
--version=0.10.0 \
--generate-name \
nvdp/nvidia-device-plugin
3、查看是否有nvidia
kubectl describe node 192.168.1.56 | grep nv
nvidia.com/gpu.present=true
nvidia.com/gpu: 1
nvidia.com/gpu: 1
kube-system nvidia-device-plugin-1637728448-fgg2d 0 (0%) 0 (0%) 0 (0%) 0 (0%) 50s
nvidia.com/gpu 0 0
root@hello:~#
下载镜像
root@hello:~# docker pull registry.cn-beijing.aliyuncs.com/ai-samples/tensorflow:1.5.0-devel-gpu root@hello:~# docker save -o tensorflow-gpu.tar registry.cn-beijing.aliyuncs.com/ai-samples/tensorflow:1.5.0-devel-gpu root@hello:~# docker load -i tensorflow-gpu.tar
创建tensorflow测试pod
root@hello:~# vim gpu-test.yaml
root@hello:~# cat gpu-test.yaml
apiVersion: v1
kind: Pod
metadata:
name: test-gpu
labels:
test-gpu: "true"
spec:
containers:
- name: training
image: registry.cn-beijing.aliyuncs.com/ai-samples/tensorflow:1.5.0-devel-gpu
command:
- python
- tensorflow-sample-code/tfjob/docker/mnist/main.py
- --max_steps=300
- --data_dir=tensorflow-sample-code/data
resources:
limits:
nvidia.com/gpu: 1
tolerations:
- effect: NoSchedule
operator: Exists
root@hello:~#
root@hello:~# kubectl apply -f gpu-test.yaml
pod/test-gpu created
查看log
kubectl logs test-gpu WARNING:tensorflow:From tensorflow-sample-code/tfjob/docker/mnist/main.py:120: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version. Instructions for updating: Future major versions of TensorFlow will allow gradients to flow into the labels input on backprop by default. See tf.nn.softmax_cross_entropy_with_logits_v2. 2021-11-24 04:38:50.846973: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:895] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2021-11-24 04:38:50.847698: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties: name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59 pciBusID: 0000:00:10.0 totalMemory: 14.75GiB freeMemory: 14.66GiB 2021-11-24 04:38:50.847759: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla T4, pci bus id: 0000:00:10.0, compute capability: 7.5) root@hello:~#

浙公网安备 33010602011771号