nvidia-fabricmanager.service 版本与驱动版本不一致问题处理
nvidia-fabricmanager.service 服务定期异常,检查nvidia-fabricmanager.service前期有安装定期被重置。手动进行卸载错误版本,找到驱动对应fabricmanager版本。
1.检查系统卡间通讯状态异常,原因分析为nvidia-fabricmanager.service状态异常
nvidia-smi topo -p2p n
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
GPU0 X NS NS NS NS NS NS NS
GPU1 NS X NS NS NS NS NS NS
Legend:
X = Self
OK = Status Ok
CNS = Chipset not supported
GNS = GPU not supported
TNS = Topology not supported
NS = Not supported
U = Unknown
systemctl status nvidia-fabricmanager.service
× nvidia-fabricmanager.service - NVIDIA fabric manager service
Loaded: loaded (/usr/lib/systemd/system/nvidia-fabricmanager.service; enabled; preset: enabled)
Active: failed (Result: exit-code) since Sat 2025-06-21 06:50:16 CST; 8h ago
Duration: 1d 7h 36min 25.870s
Process: 15866 ExecStart=/usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg (code=exited, status=1/FAILURE)
CPU: 16ms
Jun 21 06:50:16 ai-uat-master-1 systemd[1]: Starting nvidia-fabricmanager.service - NVIDIA fabric manager service...
Jun 21 06:50:16 ai-uat-master-1 nv-fabricmanager[15868]: fabric manager NVIDIA GPU driver interface version 550.163.01 don't match with driver version 550.144.03. Please update with matching NVIDIA driver package.
Jun 21 06:50:16 ai-uat-master-1 nv-fabricmanager[15868]: fabric manager NVIDIA GPU driver interface version 550.163.01 don't match with driver version 550.144.03. Please update with matching NVIDIA driver package.
Jun 21 06:50:16 ai-uat-master-1 systemd[1]: nvidia-fabricmanager.service: Control process exited, code=exited, status=1/FAILURE
Jun 21 06:50:16 ai-uat-master-1 systemd[1]: nvidia-fabricmanager.service: Failed with result 'exit-code'.
Jun 21 06:50:16 ai-uat-master-1 systemd[1]: Failed to start nvidia-fabricmanager.service - NVIDIA fabric manager service.
journalctl -u nvidia-fabricmanager.service
Jun 20 06:10:01 ai-uat-master-1 systemd[1]: Stopping nvidia-fabricmanager.service - NVIDIA fabric manager service...
Jun 20 06:10:01 ai-uat-master-1 systemd[1]: nvidia-fabricmanager.service: Deactivated successfully.
Jun 20 06:10:01 ai-uat-master-1 systemd[1]: Stopped nvidia-fabricmanager.service - NVIDIA fabric manager service.
Jun 20 06:10:01 ai-uat-master-1 systemd[1]: nvidia-fabricmanager.service: Consumed 26.895s CPU time, 7.4M memory peak, 0B memory swap peak.
Jun 20 06:10:01 ai-uat-master-1 systemd[1]: Starting nvidia-fabricmanager.service - NVIDIA fabric manager service...
Jun 20 06:10:01 ai-uat-master-1 nv-fabricmanager[37889]: fabric manager NVIDIA GPU driver interface version 550.163.01 don't match with driver version 550.144.03. Please update with matching NVIDIA driver package.
Jun 20 06:10:01 ai-uat-master-1 nv-fabricmanager[37889]: fabric manager NVIDIA GPU driver interface version 550.163.01 don't match with driver version 550.144.03. Please update with matching NVIDIA driver package.
Jun 20 06:10:01 ai-uat-master-1 systemd[1]: nvidia-fabricmanager.service: Control process exited, code=exited, status=1/FAILURE
Jun 20 06:10:01 ai-uat-master-1 systemd[1]: nvidia-fabricmanager.service: Failed with result 'exit-code'.
Jun 20 06:10:01 ai-uat-master-1 systemd[1]: Failed to start nvidia-fabricmanager.service - NVIDIA fabric manager service.
Jun 20 06:46:36 ai-uat-master-1 systemd[1]: Starting nvidia-fabricmanager.service - NVIDIA fabric manager service...
Jun 20 06:46:36 ai-uat-master-1 nv-fabricmanager[29176]: fabric manager NVIDIA GPU driver interface version 550.163.01 don't match with driver version 550.144.03. Please update with matching NVIDIA driver package.
Jun 20 06:46:36 ai-uat-master-1 nv-fabricmanager[29176]: fabric manager NVIDIA GPU driver interface version 550.163.01 don't match with driver version 550.144.03. Please update with matching NVIDIA driver package.
Jun 20 06:46:36 ai-uat-master-1 systemd[1]: nvidia-fabricmanager.service: Control process exited, code=exited, status=1/FAILURE
Jun 20 06:46:36 ai-uat-master-1 systemd[1]: nvidia-fabricmanager.service: Failed with result 'exit-code'.
Jun 20 06:46:36 ai-uat-master-1 systemd[1]: Failed to start nvidia-fabricmanager.service - NVIDIA fabric manager service.
Jun 19 23:13:49 ai-uat-master-1 systemd[1]: Starting nvidia-fabricmanager.service - NVIDIA fabric manager service...
Jun 19 23:13:50 ai-uat-master-1 nv-fabricmanager[65074]: Connected to 1 node.
Jun 19 23:13:50 ai-uat-master-1 nv-fabricmanager[65074]: Successfully configured all the available NVSwitches to route GPU NVLink traffic. NVLink Peer-to-Peer support will be enabled once the GPUs are successfully registered with t>
Jun 19 23:13:50 ai-uat-master-1 systemd[1]: Started nvidia-fabricmanager.service - NVIDIA fabric manager service.
Jun 21 06:50:16 ai-uat-master-1 systemd[1]: Stopping nvidia-fabricmanager.service - NVIDIA fabric manager service...
Jun 21 06:50:16 ai-uat-master-1 systemd[1]: nvidia-fabricmanager.service: Deactivated successfully.
Jun 21 06:50:16 ai-uat-master-1 systemd[1]: Stopped nvidia-fabricmanager.service - NVIDIA fabric manager service.
Jun 21 06:50:16 ai-uat-master-1 systemd[1]: nvidia-fabricmanager.service: Consumed 1min 2.079s CPU time, 7.7M memory peak, 0B memory swap peak.
Jun 21 06:50:16 ai-uat-master-1 systemd[1]: Starting nvidia-fabricmanager.service - NVIDIA fabric manager service...
Jun 21 06:50:16 ai-uat-master-1 nv-fabricmanager[15868]: fabric manager NVIDIA GPU driver interface version 550.163.01 don't match with driver version 550.144.03. Please update with matching NVIDIA driver package.
Jun 21 06:50:16 ai-uat-master-1 nv-fabricmanager[15868]: fabric manager NVIDIA GPU driver interface version 550.163.01 don't match with driver version 550.144.03. Please update with matching NVIDIA driver package.
Jun 21 06:50:16 ai-uat-master-1 systemd[1]: nvidia-fabricmanager.service: Control process exited, code=exited, status=1/FAILURE
Jun 21 06:50:16 ai-uat-master-1 systemd[1]: nvidia-fabricmanager.service: Failed with result 'exit-code'.
Jun 21 06:50:16 ai-uat-master-1 systemd[1]: Failed to start nvidia-fabricmanager.service - NVIDIA fabric manager service.
2.查询版本异常,卸载不匹配版本,安装新版本
dpkg -l |grep nvidia-fabricmanager
ii nvidia-fabricmanager-550 550.163.01-0ubuntu0.24.04.1 amd64 Fabric Manager for NVSwitch based systems.
dpkg -P nvidia-fabricmanager-550
(Reading database ... 126631 files and directories currently installed.)
Purging configuration files for nvidia-fabricmanager-550 (550.163.01-0ubuntu0.24.04.1) ...
dpkg: warning: while removing nvidia-fabricmanager-550, directory '/usr/share/nvidia' not empty so not removed
dpkg -l |grep nvidia-fabricmanager
nvidia-smi --version
NVIDIA-SMI version : 550.144.03
NVML version : 550.144
DRIVER version : 550.144.03
CUDA Version : 12.4
dpkg -i nvidia-fabricmanager-550_550.144.03-1_amd64.deb
Selecting previously unselected package nvidia-fabricmanager-550.
(Reading database ... 126630 files and directories currently installed.)
Preparing to unpack nvidia-fabricmanager-550_550.144.03-1_amd64.deb ...
Unpacking nvidia-fabricmanager-550 (550.144.03-1) ...
Setting up nvidia-fabricmanager-550 (550.144.03-1) ...
3.启动对应驱动版本nvidia-fabricmanager
systemctl enable nvidia-fabricmanager
Created symlink /etc/systemd/system/multi-user.target.wants/nvidia-fabricmanager.service → /usr/lib/systemd/system/nvidia-fabricmanager.service.
systemctl start nvidia-fabricmanager
dpkg -l |grep nvidia-fabricmanager
ii nvidia-fabricmanager-550 550.144.03-1 amd64 Fabric Manager for NVSwitch based systems.
systemctl status nvidia-fabricmanager
● nvidia-fabricmanager.service - NVIDIA fabric manager service
Loaded: loaded (/usr/lib/systemd/system/nvidia-fabricmanager.service; enabled; preset: enabled)
Active: active (running) since Sat 2025-06-21 14:59:38 CST; 4min 44s ago
Process: 46938 ExecStart=/usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg (code=exited, status=0/SUCCESS)
Main PID: 46940 (nv-fabricmanage)
Tasks: 18 (limit: 9830)
Memory: 4.1M (peak: 8.1M)
CPU: 449ms
CGroup: /system.slice/nvidia-fabricmanager.service
└─46940 /usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg
Jun 21 14:59:37 ai-uat-master-1 systemd[1]: Starting nvidia-fabricmanager.service - NVIDIA fabric manager service...
Jun 21 14:59:38 ai-uat-master-1 nv-fabricmanager[46940]: Connected to 1 node.
Jun 21 14:59:38 ai-uat-master-1 nv-fabricmanager[46940]: Successfully configured all the available NVSwitches to route GPU NVLink traffic. NVLink Peer-to-Peer support will be enabled once the GPUs are successfully registered with t>
Jun 21 14:59:38 ai-uat-master-1 systemd[1]: Started nvidia-fabricmanager.service - NVIDIA fabric manager service.
nvidia-smi topo -p2p n
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
GPU0 X OK OK OK OK OK OK OK
GPU1 OK X OK OK OK OK OK OK
Legend:
X = Self
OK = Status Ok
CNS = Chipset not supported
GNS = GPU not supported
TNS = Topology not supported
NS = Not supported
U = Unknown
浙公网安备 33010602011771号