Akemi

Helm部署nvidia-pulgin

2025/06/30

之前看了下使用gpu-operator部署的,那个太复杂了,带了一堆组件。组件越多可能遇到的问题越多,有空再研究吧

这里直接使用官方文档进行部署

前期准备

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
需要安装完nivida驱动(略)
nvidia-smi

需要安装nvidia-container-toolkit
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sed -i -e '/experimental/ s/^#//g' /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
nvidia-ctk --version

配置runtime--两个都需要
配置runtime为docker
sudo nvidia-ctk runtime configure --runtime=docker
sudo vim /etc/docker/daemon.json
"runtimes": {
"nvidia": {
"args": [],
"path": "nvidia-container-runtime"
}
}
sudo systemctl daemon-reload
sudo systemctl restart docker

配置runtime为containerd
sudo nvidia-ctk runtime configure --runtime=containerd
sudo vim /etc/containerd/config.toml
...
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "nvidia"
...

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
base_runtime_spec = ""
cni_conf_dir = ""
cni_max_conf_num = 0
container_annotations = []
pod_annotations = []
privileged_without_host_devices = false
runtime_engine = ""
runtime_path = ""
runtime_root = ""
runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
BinaryName = "/usr/bin/nvidia-container-runtime"
CriuImagePath = ""
CriuPath = ""
CriuWorkPath = ""
IoGid = 0
IoUid = 0
NoNewKeyring = false
NoPivotRoot = false
Root = ""
ShimCgroup = ""
SystemdCgroup = true
...
sudo systemctl restart containerd

插件部署

参考文档:

https://github.com/NVIDIA/k8s-device-plugin?tab=readme-ov-file#deployment-via-helm

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
# 需求镜像
registry.k8s.io/nfd/node-feature-discovery:v0.15.3(需要搭梯下载)
nvcr.io/nvidia/k8s-device-plugin:v0.17.1

# 安装仓库
kubectl create ns nvidia-device-plugin
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update

helm pull nvdp/nvidia-device-plugin --version=0.17.1 --untar
cd nvidia-device-plugin/

# 修改参数
vim values.yaml
打开gfd参数,nfd默认打开
gfd:
enabled: true
...

vim charts/node-feature-discovery/values.yaml
修改镜像配置,默认如下,默认tag v0.15.3
image:
repository: registry.k8s.io/nfd/node-feature-discovery
# This should be set to 'IfNotPresent' for released version
pullPolicy: IfNotPresent


# 部署
helm install nvidia-device-plugin ./ -f values.yaml -n nvidia-device-plugin

helm list -n nvidia-device-plugin
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
nvidia-device-plugin nvidia-device-plugin 11 2025-06-30 16:11:59.267278186 +0800 CST deployed nvidia-device-plugin-0.17.1 0.17.1

kubectl get pods -n nvidia-device-plugin
NAME READY STATUS RESTARTS AGE
nvidia-device-plugin-gpu-feature-discovery-6tzsp 1/1 Running 0 52m
nvidia-device-plugin-node-feature-discovery-gc-6d6b9d45dd-gbpm8 1/1 Running 0 91m
nvidia-device-plugin-node-feature-discovery-master-889bff7gd2xz 1/1 Running 0 106m
nvidia-device-plugin-node-feature-discovery-worker-6nzz2 1/1 Running 0 91m
nvidia-device-plugin-xfk66 1/1 Running 0 52m

kubectl describe node | grep gpu
nvidia.com/gpu.compute.major=8
nvidia.com/gpu.compute.minor=9
nvidia.com/gpu.count=8
nvidia.com/gpu.family=ada-lovelace
nvidia.com/gpu.machine=Rack-Server
nvidia.com/gpu.memory=24564
nvidia.com/gpu.mode=graphics
nvidia.com/gpu.product=NVIDIA-GeForce-RTX-4090
nvidia.com/gpu.replicas=1
nvidia.com/gpu.sharing-strategy=none
nvidia.com/vgpu.present=false
nvidia.com/gpu: 8
nvidia.com/gpu: 8
nvidia-device-plugin nvidia-device-plugin-gpu-feature-discovery-6tzsp 0 (0%) 0 (0%) 0 (0%) 0 (0%) 81m
nvidia.com/gpu 0 0

资源测试

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
cat test-gpu-pod.yaml 
apiVersion: v1
kind: Pod
metadata:
name: gpu-test-pod
spec:
restartPolicy: Never
containers:
- name: cuda-container
image: nvcr.io/nvidia/cuda:12.0.1-runtime-ubuntu20.04
command: ["nvidia-smi"]
resources:
limits:
nvidia.com/gpu: 1

kubectl apply -f test-gpu-pod.yaml

kubectl get pods
NAME READY STATUS RESTARTS AGE
gpu-test-pod 0/1 Completed 0 2m50s
CATALOG