nvidia-fabricmanager.service 版本与驱动版本不一致问题处理

nvidia-fabricmanager.service 服务定期异常,检查nvidia-fabricmanager.service前期有安装定期被重置。手动进行卸载错误版本,找到驱动对应fabricmanager版本。

1.检查系统卡间通讯状态异常,原因分析为nvidia-fabricmanager.service状态异常

nvidia-smi topo -p2p n

    GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7

GPU0 X NS NS NS NS NS NS NS
GPU1 NS X NS NS NS NS NS NS
Legend:
X = Self
OK = Status Ok
CNS = Chipset not supported
GNS = GPU not supported
TNS = Topology not supported
NS = Not supported
U = Unknown

systemctl status nvidia-fabricmanager.service

× nvidia-fabricmanager.service - NVIDIA fabric manager service
Loaded: loaded (/usr/lib/systemd/system/nvidia-fabricmanager.service; enabled; preset: enabled)
Active: failed (Result: exit-code) since Sat 2025-06-21 06:50:16 CST; 8h ago
Duration: 1d 7h 36min 25.870s
Process: 15866 ExecStart=/usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg (code=exited, status=1/FAILURE)
CPU: 16ms

Jun 21 06:50:16 ai-uat-master-1 systemd[1]: Starting nvidia-fabricmanager.service - NVIDIA fabric manager service...
Jun 21 06:50:16 ai-uat-master-1 nv-fabricmanager[15868]: fabric manager NVIDIA GPU driver interface version 550.163.01 don't match with driver version 550.144.03. Please update with matching NVIDIA driver package.
Jun 21 06:50:16 ai-uat-master-1 nv-fabricmanager[15868]: fabric manager NVIDIA GPU driver interface version 550.163.01 don't match with driver version 550.144.03. Please update with matching NVIDIA driver package.
Jun 21 06:50:16 ai-uat-master-1 systemd[1]: nvidia-fabricmanager.service: Control process exited, code=exited, status=1/FAILURE
Jun 21 06:50:16 ai-uat-master-1 systemd[1]: nvidia-fabricmanager.service: Failed with result 'exit-code'.
Jun 21 06:50:16 ai-uat-master-1 systemd[1]: Failed to start nvidia-fabricmanager.service - NVIDIA fabric manager service.

journalctl -u nvidia-fabricmanager.service

Jun 20 06:10:01 ai-uat-master-1 systemd[1]: Stopping nvidia-fabricmanager.service - NVIDIA fabric manager service...
Jun 20 06:10:01 ai-uat-master-1 systemd[1]: nvidia-fabricmanager.service: Deactivated successfully.
Jun 20 06:10:01 ai-uat-master-1 systemd[1]: Stopped nvidia-fabricmanager.service - NVIDIA fabric manager service.
Jun 20 06:10:01 ai-uat-master-1 systemd[1]: nvidia-fabricmanager.service: Consumed 26.895s CPU time, 7.4M memory peak, 0B memory swap peak.
Jun 20 06:10:01 ai-uat-master-1 systemd[1]: Starting nvidia-fabricmanager.service - NVIDIA fabric manager service...
Jun 20 06:10:01 ai-uat-master-1 nv-fabricmanager[37889]: fabric manager NVIDIA GPU driver interface version 550.163.01 don't match with driver version 550.144.03. Please update with matching NVIDIA driver package.
Jun 20 06:10:01 ai-uat-master-1 nv-fabricmanager[37889]: fabric manager NVIDIA GPU driver interface version 550.163.01 don't match with driver version 550.144.03. Please update with matching NVIDIA driver package.
Jun 20 06:10:01 ai-uat-master-1 systemd[1]: nvidia-fabricmanager.service: Control process exited, code=exited, status=1/FAILURE
Jun 20 06:10:01 ai-uat-master-1 systemd[1]: nvidia-fabricmanager.service: Failed with result 'exit-code'.
Jun 20 06:10:01 ai-uat-master-1 systemd[1]: Failed to start nvidia-fabricmanager.service - NVIDIA fabric manager service.
Jun 20 06:46:36 ai-uat-master-1 systemd[1]: Starting nvidia-fabricmanager.service - NVIDIA fabric manager service...
Jun 20 06:46:36 ai-uat-master-1 nv-fabricmanager[29176]: fabric manager NVIDIA GPU driver interface version 550.163.01 don't match with driver version 550.144.03. Please update with matching NVIDIA driver package.
Jun 20 06:46:36 ai-uat-master-1 nv-fabricmanager[29176]: fabric manager NVIDIA GPU driver interface version 550.163.01 don't match with driver version 550.144.03. Please update with matching NVIDIA driver package.
Jun 20 06:46:36 ai-uat-master-1 systemd[1]: nvidia-fabricmanager.service: Control process exited, code=exited, status=1/FAILURE
Jun 20 06:46:36 ai-uat-master-1 systemd[1]: nvidia-fabricmanager.service: Failed with result 'exit-code'.
Jun 20 06:46:36 ai-uat-master-1 systemd[1]: Failed to start nvidia-fabricmanager.service - NVIDIA fabric manager service.
Jun 19 23:13:49 ai-uat-master-1 systemd[1]: Starting nvidia-fabricmanager.service - NVIDIA fabric manager service...
Jun 19 23:13:50 ai-uat-master-1 nv-fabricmanager[65074]: Connected to 1 node.
Jun 19 23:13:50 ai-uat-master-1 nv-fabricmanager[65074]: Successfully configured all the available NVSwitches to route GPU NVLink traffic. NVLink Peer-to-Peer support will be enabled once the GPUs are successfully registered with t>
Jun 19 23:13:50 ai-uat-master-1 systemd[1]: Started nvidia-fabricmanager.service - NVIDIA fabric manager service.
Jun 21 06:50:16 ai-uat-master-1 systemd[1]: Stopping nvidia-fabricmanager.service - NVIDIA fabric manager service...
Jun 21 06:50:16 ai-uat-master-1 systemd[1]: nvidia-fabricmanager.service: Deactivated successfully.
Jun 21 06:50:16 ai-uat-master-1 systemd[1]: Stopped nvidia-fabricmanager.service - NVIDIA fabric manager service.
Jun 21 06:50:16 ai-uat-master-1 systemd[1]: nvidia-fabricmanager.service: Consumed 1min 2.079s CPU time, 7.7M memory peak, 0B memory swap peak.
Jun 21 06:50:16 ai-uat-master-1 systemd[1]: Starting nvidia-fabricmanager.service - NVIDIA fabric manager service...
Jun 21 06:50:16 ai-uat-master-1 nv-fabricmanager[15868]: fabric manager NVIDIA GPU driver interface version 550.163.01 don't match with driver version 550.144.03. Please update with matching NVIDIA driver package.
Jun 21 06:50:16 ai-uat-master-1 nv-fabricmanager[15868]: fabric manager NVIDIA GPU driver interface version 550.163.01 don't match with driver version 550.144.03. Please update with matching NVIDIA driver package.
Jun 21 06:50:16 ai-uat-master-1 systemd[1]: nvidia-fabricmanager.service: Control process exited, code=exited, status=1/FAILURE
Jun 21 06:50:16 ai-uat-master-1 systemd[1]: nvidia-fabricmanager.service: Failed with result 'exit-code'.
Jun 21 06:50:16 ai-uat-master-1 systemd[1]: Failed to start nvidia-fabricmanager.service - NVIDIA fabric manager service.
2.查询版本异常,卸载不匹配版本,安装新版本

dpkg -l |grep nvidia-fabricmanager

ii nvidia-fabricmanager-550 550.163.01-0ubuntu0.24.04.1 amd64 Fabric Manager for NVSwitch based systems.

dpkg -P nvidia-fabricmanager-550

(Reading database ... 126631 files and directories currently installed.)
Purging configuration files for nvidia-fabricmanager-550 (550.163.01-0ubuntu0.24.04.1) ...
dpkg: warning: while removing nvidia-fabricmanager-550, directory '/usr/share/nvidia' not empty so not removed

dpkg -l |grep nvidia-fabricmanager

nvidia-smi --version

NVIDIA-SMI version : 550.144.03
NVML version : 550.144
DRIVER version : 550.144.03
CUDA Version : 12.4

dpkg -i nvidia-fabricmanager-550_550.144.03-1_amd64.deb

Selecting previously unselected package nvidia-fabricmanager-550.
(Reading database ... 126630 files and directories currently installed.)
Preparing to unpack nvidia-fabricmanager-550_550.144.03-1_amd64.deb ...
Unpacking nvidia-fabricmanager-550 (550.144.03-1) ...
Setting up nvidia-fabricmanager-550 (550.144.03-1) ...
3.启动对应驱动版本nvidia-fabricmanager

systemctl enable nvidia-fabricmanager

Created symlink /etc/systemd/system/multi-user.target.wants/nvidia-fabricmanager.service → /usr/lib/systemd/system/nvidia-fabricmanager.service.

systemctl start nvidia-fabricmanager

dpkg -l |grep nvidia-fabricmanager

ii nvidia-fabricmanager-550 550.144.03-1 amd64 Fabric Manager for NVSwitch based systems.

systemctl status nvidia-fabricmanager

● nvidia-fabricmanager.service - NVIDIA fabric manager service
Loaded: loaded (/usr/lib/systemd/system/nvidia-fabricmanager.service; enabled; preset: enabled)
Active: active (running) since Sat 2025-06-21 14:59:38 CST; 4min 44s ago
Process: 46938 ExecStart=/usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg (code=exited, status=0/SUCCESS)
Main PID: 46940 (nv-fabricmanage)
Tasks: 18 (limit: 9830)
Memory: 4.1M (peak: 8.1M)
CPU: 449ms
CGroup: /system.slice/nvidia-fabricmanager.service
└─46940 /usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg

Jun 21 14:59:37 ai-uat-master-1 systemd[1]: Starting nvidia-fabricmanager.service - NVIDIA fabric manager service...
Jun 21 14:59:38 ai-uat-master-1 nv-fabricmanager[46940]: Connected to 1 node.
Jun 21 14:59:38 ai-uat-master-1 nv-fabricmanager[46940]: Successfully configured all the available NVSwitches to route GPU NVLink traffic. NVLink Peer-to-Peer support will be enabled once the GPUs are successfully registered with t>
Jun 21 14:59:38 ai-uat-master-1 systemd[1]: Started nvidia-fabricmanager.service - NVIDIA fabric manager service.

nvidia-smi topo -p2p n

    GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7

GPU0 X OK OK OK OK OK OK OK
GPU1 OK X OK OK OK OK OK OK
Legend:
X = Self
OK = Status Ok
CNS = Chipset not supported
GNS = GPU not supported
TNS = Topology not supported
NS = Not supported
U = Unknown

posted on 2025-06-21 15:34  喵++喵  阅读(655)  评论(0)    收藏  举报

导航