[本科项目实训] NVIDIA-GPU 显存异常与处理
现象描述
在运行程序时,发现torch.cuda.OutOfMemoryError: CUDA out of memory.
错误,考虑模型大小远小于所用显卡显存,使用:
$ nvidia-smi
# 或每隔两秒自动刷新
$ watch -n 2 -d nvidia-smi
进行查看,发现显存占用高且GPU利用低,结果如下:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.67 Driver Version: 550.67 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3090 On | 00000000:12:00.0 Off | N/A |
| 38% 28C P8 20W / 350W | 12120MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+
考虑意外关闭的程序产生内存泄漏,拟打算手动释放显存。
解决方案
使用fuser
工具进行进程查询,如果未安装过该指令,使用以下指令进行下载:
# Ubuntu20.04
$ apt-get install psmisc
使用以下指令进行查询:
$ fuser -v /dev/nvidia*
USER PID ACCESS COMMAND
/dev/nvidia5: root kernel mount /dev/nvidia5
root 47329 F...m pt_main_thread
/dev/nvidiactl: root kernel mount /dev/nvidiactl
root 47329 F...m pt_main_thread
/dev/nvidia-uvm: root kernel mount /dev/nvidia-uvm
root 47329 F...m pt_main_thread
/dev/nvidia-uvm-tools:
root kernel mount /dev/nvidia-uvm-tools
使用kill -9 pid
杀死对应进程即可:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.67 Driver Version: 550.67 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3090 On | 00000000:12:00.0 Off | N/A |
| 38% 28C P8 20W / 350W | 1MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
可使用以下指令直接完成上述操作:
fuser -v /dev/nvidia* |awk '{for(i=1;i<=NF;i++)print "kill -9 " $i;}' | sh