nohup训练pytorch模型时的报错以及tmux的简单使用

问题：

在使用nohup命令后台训练pytorch模型时，关闭ssh窗口，有时会遇到下面报错：

WARNING:torch.distributed.elastic.agent.server.api:Received 1 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4156332 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4156333 closing signal SIGHUP
Traceback (most recent call last):
File "/home/user2/anaconda3/envs/mmlab/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/user2/anaconda3/envs/mmlab/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/user2/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/user2/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/user2/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/user2/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/distributed/run.py", line 713, in run
)(cmd_args)
File "/home/user2/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/user2/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 252, in launch_agent
result = agent.run()
File "/home/user2/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper
result = f(args, **kwargs)
File "/home/user2/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/distributed/elastic/agent/server/api.py", line 709, in run
result = self._invoke_run(role)
File "/home/user2/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/distributed/elastic/agent/server/api.py", line 843, in _invoke_run
time.sleep(monitor_interval)
File "/home/user2/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 60, in _terminate_process_handler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 4156314 got signal: 1

这是nohup的bug，我们可以使用tmux来替换nohup。

解决方案：

直接看阮一峰大佬写的教程，详细且清晰，几分钟就能学会使用：Tmux 使用教程 - 阮一峰的网络日志 (ruanyifeng.com)

我在这稍微整理一下tmux的命令，如果只是简单后台训练，用下面几个命令就够用：

sudo apt-get install tmux   # 安装
tmux                        # 进入tmux窗口
exit                        # 推出tmux窗口，或者使用快捷键[ Ctrl+d ]
tmux new -s ${session-name} # 创建一个会话，并设置会话名
# 快捷键[ Ctrl+b ] 是tmux的前缀键，用完前缀键后可以继续按指定键来完成指定命令
[ Ctrl+b ] [ d ]                         # 将会话与窗口分离，或者[ Ctrl+b ] tmux detach
tmux ls                                  # 查看所有会话，或者使用tmux list-session
tmux attach -t ${session-name}           #  根据会话名将terminal窗口接入会话
tmux kill-session -t ${session-name}     #  根据会话名杀死会话
tmux switch -t ${session-name}           # 根据会话名切换会话
tmux rename-session -t 0 ${session-name} # 根据会话名，重命名会话

tmux简单使用流程：

[terminal]: tmux new -s train_model       # 创建一个会话，并设置会话名:train_model
[tmux]: conda activate env_name           # 在tmux会话中，我们激活我们要使用的conda环境
[tmux]: python train.py                   # 在tmux会话中，开始训练我们的模型
[tmux]: [ Ctrl+b ] [ d ]                  # 将会话与窗口分离
[terminal]: tmux ls                       # 查看我们刚刚创建的会话
[terminal]: watch -n 1 -c gpustat --color # 监控我们的gpu信息

posted @ 2022-10-01 09:23 gy77 阅读(3369) 评论(2) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

一届书生

本博客所有内容以学习、研究和分享为主，如需转载，请标明作者和出处，并且是非商业用途，谢谢。

nohup训练pytorch模型时的报错以及tmux的简单使用

问题：

解决方案：

公告