[本科项目实训] NVIDIA-GPU 显存异常与处理

现象描述

在运行程序时,发现torch.cuda.OutOfMemoryError: CUDA out of memory.错误,考虑模型大小远小于所用显卡显存,使用:

$ nvidia-smi
# 或每隔两秒自动刷新
$ watch -n 2 -d nvidia-smi

进行查看,发现显存占用高且GPU利用低,结果如下:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.67                 Driver Version: 550.67         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        On  |   00000000:12:00.0 Off |                  N/A |
| 38%   28C    P8             20W /  350W |   12120MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+

考虑意外关闭的程序产生内存泄漏,拟打算手动释放显存。

解决方案

使用fuser工具进行进程查询,如果未安装过该指令,使用以下指令进行下载:

# Ubuntu20.04
$ apt-get install psmisc

使用以下指令进行查询:

$ fuser -v /dev/nvidia*
                     USER        PID ACCESS COMMAND
/dev/nvidia5:        root     kernel mount /dev/nvidia5
                     root      47329 F...m pt_main_thread
/dev/nvidiactl:      root     kernel mount /dev/nvidiactl
                     root      47329 F...m pt_main_thread
/dev/nvidia-uvm:     root     kernel mount /dev/nvidia-uvm
                     root      47329 F...m pt_main_thread
/dev/nvidia-uvm-tools:
                     root     kernel mount /dev/nvidia-uvm-tools

使用kill -9 pid杀死对应进程即可:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.67                 Driver Version: 550.67         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        On  |   00000000:12:00.0 Off |                  N/A |
| 38%   28C    P8             20W /  350W |       1MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

可使用以下指令直接完成上述操作:

fuser -v /dev/nvidia* |awk '{for(i=1;i<=NF;i++)print "kill -9 " $i;}' | sh

参考资料

[1] 解决gpu没有运行进程,但是显存一直占用的方式_此gpu上没有正在运行的程序是什么意思-CSDN博客

[2] 释放异常占用的GPU内存_gpu upload与释放-CSDN博客

posted @ 2024-06-23 13:11  yicheng_liu0219  阅读(832)  评论(0)    收藏  举报