ubuntu 22.04.04 部署k8s管理GPU服务器

一、安装驱动

1.1

sudo ubuntu-drivers list --gpgpu

自动安装:执行 sudo ubuntu-drivers autoinstall

手动安装:执行 sudo apt install nvidia-driver-×××

1.2安装后验证

必须执行 sudo reboot 重启系统。
重启后,运行 nvidia-smi,如果成功显示 GPU 信息(如下图),则代表驱动安装成功

user@k8s-train-1:~$ sudo nvidia-smi 
[sudo] password for user: 
Wed Apr 29 14:50:14 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.18             Driver Version: 580.126.18     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A10                     Off |   00000000:16:00.0 Off |                    0 |
|  0%   33C    P8             10W /  150W |       0MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A10                     Off |   00000000:34:00.0 Off |                    0 |
|  0%   34C    P8             10W /  150W |       0MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A10                     Off |   00000000:AC:00.0 Off |                  Off |
|  0%   33C    P8             10W /  150W |       0MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A10                     Off |   00000000:CA:00.0 Off |                  Off |
|  0%   32C    P8             10W /  150W |       0MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

1.3查看是否支持NVLINK

user@k8s-train-1:~$ sudo nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NODE    NODE    NODE    0-119   0               N/A
GPU1    NODE     X      NODE    NODE    0-119   0               N/A
GPU2    NODE    NODE     X      NODE    0-119   0               N/A
GPU3    NODE    NODE    NODE     X      0-119   0   

2安装 CUDA Toolkit

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-13-0

检查安装成功
root@iZbp15lv2der847tlwkkd3Z:~# nvcc -V

3GPU Docker 环境

1、Install the prerequisites for the instructions below:

sudo apt-get update && sudo apt-get install -y --no-install-recommends \
   ca-certificates \
   curl \
   gnupg2
   
2、Configure the production repository:

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
3、Optionally, configure the repository to use experimental packages:

sudo sed -i -e '/experimental/ s/^#//g' /etc/apt/sources.list.d/nvidia-container-toolkit.list
Update the packages list from the repository:

sudo apt-get update

4、Install the NVIDIA Container Toolkit packages:

export NVIDIA_CONTAINER_TOOLKIT_VERSION=1.19.0-1
  sudo apt-get install -y \
      nvidia-container-toolkit=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
      nvidia-container-toolkit-base=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
      libnvidia-container-tools=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
      libnvidia-container1=${NVIDIA_CONTAINER_TOOLKIT_VERSION}

5、Restart containerd and check whether the runtime has been set correctly:
   sudo systemctl restart containerd
   sudo /etc/eks/bootstrap.sh ${YOUR_CLUSTER_NAME} --container-runtime nvidia-container-runtime

K8s:安装 device-plugin
1、Create the DaemonSet using:
    kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.16.1/deployments/static/nvidia-device-plugin.yml
2、To apply the plugin to your cluster, run the following command from your local machine:
    kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.16.1/deployments/static/nvidia-device-plugin.yml
3、Verify that there are allocatable GPUs:
    kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"

4参考链接:

https://documentation.ubuntu.com/aws/aws-how-to/kubernetes/enable-gpus-on-eks/#install-a-gpu-driver-on-each-node

 

posted @ 2026-05-15 16:03  hopeccie  阅读(8)  评论(0)    收藏  举报