关于pytorch分布式训练dist.barrier()产生死锁的问题
1. 安装nccl库支持
2. 导入环境变量:
vim /etc/profile
export NCCL_P2P_DISABLE=1
export NCCL_IB_DISABLE=1
https://gitee.com/573363031/baidu_ai_security_advbox/blob/master/paddle.md
source /etc/profile
1. 安装nccl库支持
2. 导入环境变量:
vim /etc/profile
export NCCL_P2P_DISABLE=1
export NCCL_IB_DISABLE=1
https://gitee.com/573363031/baidu_ai_security_advbox/blob/master/paddle.md
source /etc/profile