重新配置语义分割实验环境遇到的坑

新的框架是semseg, by hs-z      https://github.com/hszhao/semseg

1. 安装apex报错:fatal error: gnu-crypt.h: No such file or directory

本质上是cryptacular的pip源有问题,使用conda install cryptacular即可

 

2. pip install总是安装到别的虚拟环境里

这是因为当前正在使用的pip并非当前虚拟环境里的。这里conda install会默认安装到当前虚拟环境,但是pip并不会。

所以使用 whereis pip查看想要的当前虚拟环境的pip程序的位置,然后使用绝对路径来执行pip install即可

 

3. 关于pip和conda的源

今天是2019年6月10日,目前conda的清华源因为版权问题已经关闭,而pip的清华源仍可以正常使用。

 

4. ModuleNotFoundError: No module named 'yaml'

应该是

conda install pyaml

 

 

5. TypeError: Class advice impossible in Python3.  Use the @implementer class decorator instead

首先切换当前的CUDA版本与pytorch的CUDA版本一致

然后卸载已经安装过的apex。

然后: 

git clone https://github.com/NVIDIA/apex.git && cd apex && pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .

 

对于多虚拟环境,可能会有些错乱。此时使用 

whereis pip

来找到你当前虚拟环境的pip执行程序的位置。然后使用pip的绝对路径进行操作。

包括apex上面的最后一步的python也可以使用其绝对路径来安装,保证一定安装到了正确的位置

 

不应该使用下面的这行命令来安装apex:

git clone https://github.com/NVIDIA/apex.git && cd apex && python setup.py install --cuda_ext --cpp_ext

原因可能是我在conda虚拟环境中

 

参考:https://github.com/NVIDIA/apex/issues/214#issuecomment-476399539

 

6. 什么报错都没有,用PDB也没有。在 x = self.layer0(x) 处消失

batch size和输入图片尺寸小一些就好了

 

7. Warning:  apex was installed without --cpp_ext.  Falling back to Python flatten and unflatten.

在已经正确安装的前提下,还坚持报这个错误,说明是个有深度的错误。

 

根据:https://discuss.pytorch.org/t/undefined-symbol-when-import-lltm-cpp-extension/32627/2

两个解决方法:

  1. build cpp extensions with -D_GLIBCXX_USE_CXX11_ABI=1.
  2. build pytorch with -D_GLIBCXX_USE_CXX11_ABI=0

但是apex如何设置额外的编译参数我也不会。根据他们提供的export方法,即:

export CFLAGS="-D_GLIBCXX_USE_CXX11_ABI=1 $CFLAGS"

然后再编译apex,发现在编译过程中这个参数还是等于0,没有效果。

 

最终根据下面这段话:

The best way to solve this problem in any case is to compile Pytorch from source and use that same compiler for the extension. Then all problems go away.

 

决定还是把pytorch和apex都在本机上从源代码编译一遍得了。

 

然后发现pytorch从源代码编译很有困难……遇到了一堆找不到解决办法的BUG,最后想了想把pytorch安装回去吧。

之前安装pytorch和这次的途径不同:

 

之前的方式是:

conda install pytorch torchvision cudatoolkit=10.0

为了加快下载速度,就不想从pytorch官方源下载,而是选择了从conda源下载。

然而在python中,使用:

torch._C._GLIBCXX_USE_CXX11_ABI

发现是True,也就是说

-D_GLIBCXX_USE_CXX11_ABI=1

不满足要求

 

这次使用了pytorch官方的源:

conda install pytorch torchvision cudatoolkit=10.0 -c pytorch

神奇的事情发生了,这次

-D_GLIBCXX_USE_CXX11_ABI=0了

还是pytorch官方靠谱……conda上是收录的官方的编译包,更新的不够及时。

 

然后就OK了……………………

 

总结一下,

1. 靠谱的还是官方,不要图省事,也不要总想着自己去编译,那样子问题更多。

2. 遇到问题要到github上相应仓库的issue去查询,这也很重要。特别是,要用英文查询。中文查询都是二手信息。

3. Google的搜索能力的确很厉害,尽量用Google!

 

 

8. ValueError: batch_size should be a positive integer value, but got batch_size=0 

在config文件里的

batch_size_val: 8  # batch size for validation during training, memory and speed tradeoff

需要设置为和GPU一样的数量,虽然不知道为什么。

 

9. cv2.error: OpenCV(4.1.0) /io/opencv/modules/imgproc/src/color.cpp:182: error: (-215:Assertion failed) !_src.empty() in function 'cvtColor'

data_root不正确,没有读取到数据

 

 

10. Exception: process 0 terminated with signal SIGSEGV

内存不足

目前还无法解决

 只能把分布式训练给关了,暂时可以运行,但是很慢

 

11. pip下载速度慢

linux下,修改 ~/.pip/pip.conf (没有就创建一个), 修改 index-url至tuna,内容如下:

 [global]
 index-url = https://pypi.tuna.tsinghua.edu.cn/simple
 

12. 查看pytorch对应的cuda版本

print(torch.version.cuda)

 

13. libSM.so.6: cannot open shared object file: No such file or directory

https://stackoverflow.com/questions/47113029/importerror-libsm-so-6-cannot-open-shared-object-file-no-such-file-or-directo

pip install opencv-python-headless
# also contrib, if needed
pip install opencv-contrib-python-headless

 

 

14. fatal error: gnu-crypt.h: No such file or directory

在安装apex过程中出现的。应该使用

conda install cryptacular

然后再安装apex

 

15.  OMP: Error #13: Assertion failure at z_Linux_util.cpp(2361).

 

OMP: Hint Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and ope
rating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see http://
www.intel.com/software/products/support/.

Traceback (most recent call last): │··································
File "tool/train.py", line 456, in <module> │··································
main() │··································
File "tool/train.py", line 106, in main │··································
mp.spawn(main_worker, nprocs=args.ngpus_per_node, args=(args.ngpus_per_node, args)) │··································
File "/home/xxx/.conda/envs/pytorchseg10/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn │··································
while not spawn_context.join(): │··································
File "/home/xxx/.conda/envs/pytorchseg10/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 107, in join │··································
(error_index, name) │··································
Exception: process 2 terminated with signal SIGABRT

在新服务器上试图重新配置环境,然而遇到这个问题

 

解决方法:

在train.sh里添加这一行

export KMP_INIT_AT_FORK=FALSE

 

此时的train.sh为

#!/bin/sh
PARTITION=gpu
PYTHON=python

dataset=$1
exp_name=$2
exp_dir=exp/${dataset}/${exp_name}
model_dir=${exp_dir}/model
result_dir=${exp_dir}/result
config=config/${dataset}/${dataset}_${exp_name}.yaml
now=$(date +"%Y%m%d_%H%M%S")

mkdir -p ${model_dir} ${result_dir}
cp tool/train.sh tool/train.py ${config} ${exp_dir}

export PYTHONPATH=./
export KMP_INIT_AT_FORK=FALSE
#sbatch -p $PARTITION --gres=gpu:8 -c16 --job-name=train \
$PYTHON -u tool/train.py \
  --config=${config} \
  2>&1 | tee ${model_dir}/train-$now.log
View Code

 

参考:https://github.com/ContinuumIO/anaconda-issues/issues/11294

 

 

 

16.  RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1556653114079/work/torch/lib/c10d/ProcessGroupNCCL.cpp:272, unhandled cuda error

很多地方(https://github.com/pytorch/pytorch/issues/23534

说是NCCL的版本问题,于是打印版本:(在命令行里,运行python之前)

export NCCL_DEBUG=VERSION

然后执行程序,看到我的版本是2.4.8

此时,标题中的报错不再出现…… 

所以这个错误是一种表象,掩盖了实际的错误

 

但是运行了一段时间后又自动断开了,再次运行还是这个错误…… 很奇怪

在打印了debug信息后:加入

export NCCL_DEBUG=info

发现有一个错误:

Cuda failure 'an illegal memory access was encountered'

 

这个问题没有通行的解决方案,每个人的问题都不太一样。

我突然发现,每次到GPU3的进程开始初始化时就会报错,然后取消使用GPU3 发现错误解决了…… 难道是硬件坏了

经过排查,已经确定只有在GPU3上有问题。运行多次,报错不同,这里记录一下

报错记录1:

Traceback (most recent call last):
  File "tool/train.py", line 480, in <module>
    main()
  File "tool/train.py", line 115, in main
    main_worker(args.train_gpu, args.ngpus_per_node, args)
  File "tool/train.py", line 290, in main_worker
    loss_train, mIoU_train, mAcc_train, allAcc_train = train(train_loader, model, optimizer, epoch)
  File "tool/train.py", line 335, in train
    output, main_loss, aux_loss = model(input, target)
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/lzx/segmentation_exp_LINUX/L012_cell/models/pspnet.py", line 93, in forward
    x = self.layer4(x_tmp)
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/lzx/segmentation_exp_LINUX/L012_cell/models/resnet.py", line 82, in forward
    out = self.conv2(out)
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 345, in forward
    return self.conv2d_forward(input, self.weight)
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 342, in conv2d_forward
    self.padding, self.dilation, self.groups)
RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
View Code

报错记录2:

Traceback (most recent call last):
  File "tool/train.py", line 480, in <module>
    main()
  File "tool/train.py", line 115, in main
    main_worker(args.train_gpu, args.ngpus_per_node, args)
  File "tool/train.py", line 290, in main_worker
    loss_train, mIoU_train, mAcc_train, allAcc_train = train(train_loader, model, optimizer, epoch)
  File "tool/train.py", line 335, in train
    output, main_loss, aux_loss = model(input, target)
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/lzx/segmentation_exp_LINUX/L012_cell/models/pspnet.py", line 92, in forward
    x_tmp = self.layer3(x)
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/lzx/segmentation_exp_LINUX/L012_cell/models/resnet.py", line 82, in forward
    out = self.conv2(out)
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 345, in forward
    return self.conv2d_forward(input, self.weight)
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 342, in conv2d_forward
    self.padding, self.dilation, self.groups)
RuntimeError: cublas runtime error : resource allocation failed at /opt/conda/conda-bld/pytorch_1568696969690/work/aten/src/THC/THCGeneral.cpp:216
View Code

 

对于报错1,

根据 https://github.com/qqwweee/keras-yolo3/issues/332#issuecomment-517989338

应该安装补丁: https://developer.nvidia.com/cuda-10.0-download-archive?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1604&target_type=runfilelocal

安装后还是没用

 

对于报错记录2:根据https://github.com/huggingface/transfer-learning-conv-ai/issues/10#issuecomment-496111466

增加了export CUDA_LAUNCH_BLOCKING=1

下面是报错记录3:

Traceback (most recent call last):
  File "tool/train.py", line 480, in <module>
    main()
  File "tool/train.py", line 115, in main
    main_worker(args.train_gpu, args.ngpus_per_node, args)
  File "tool/train.py", line 290, in main_worker
    loss_train, mIoU_train, mAcc_train, allAcc_train = train(train_loader, model, optimizer, epoch)
  File "tool/train.py", line 335, in train
    output, main_loss, aux_loss = model(input, target)
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/lzx/segmentation_exp_LINUX/L012_cell/models/pspnet.py", line 90, in forward
    x = self.layer1(x)
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/lzx/segmentation_exp_LINUX/L012_cell/models/resnet.py", line 90, in forward
    residual = self.downsample(x)
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/module.py", line 545, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 345, in forward
    return self.conv2d_forward(input, self.weight)
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 342, in conv2d_forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
View Code

添加了下面一行后暂时可以了, 

torch.backends.cudnn.benchmark = True

好了十分钟,又坏了,报错和记录3一样……

 

 ==================================================================================================================

 

 不确定是不是还是第三块卡的问题,所以这次使用其他的所有卡,训练多一些时间看看

使用了别的所有的卡一起训练了很久都没有问题。

 

总结两点:

1. 只使用第三块卡会有问题 

2. 单卡时,相当于非常普通的训练方式,并不会出发多线程、多进程以及分布式的代码

 

再尝试一下是不是可以通过软件层面解决,不行的话就只能归因于显卡坏掉了。或者服务器有问题

 

有人说是CUDNN的版本问题,先将CUDNN关闭:

torch.backends.cudnn.enabled = False 

关闭后,在单独使用第三块卡时候的确可以运行了。

 

然后测试使用所有卡+关闭CUDNN。在运行了20个ITERS后报错:

[2019-09-29 20:03:08,046 INFO train.py line 404 98898] Epoch: [44/200][20/186] Data 0.001 (0.106) Batch 1.274 (1.504) Remain 12:11:15 MainLoss 0.1086 AuxLoss 0.1171 Loss 0.1555 Accuracy 0.9622.
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered (insert_events at /opt/conda/conda-bld/pytorch_1568696969690/work/c10/cuda/CUDACachingAllocator.cpp:569)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x47 (0x7ffa46ff5477 in /home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x17044 (0x7ffa47231044 in /home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: <unknown function> + 0x1cccb (0x7ffa47236ccb in /home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x4d (0x7ffa46fe2e8d in /home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #4: <unknown function> + 0x1c2789 (0x7ffa7892f789 in /home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x445d2b (0x7ffa78bb2d2b in /home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x445d61 (0x7ffa78bb2d61 in /home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0x1a184f (0x558f4667b84f in /home/lzx/.conda/envs/seg/bin/python)
frame #8: <unknown function> + 0xfd1a8 (0x558f465d71a8 in /home/lzx/.conda/envs/seg/bin/python)
frame #9: <unknown function> + 0x10e3c7 (0x558f465e83c7 in /home/lzx/.conda/envs/seg/bin/python)
frame #10: <unknown function> + 0x10e3dd (0x558f465e83dd in /home/lzx/.conda/envs/seg/bin/python)
frame #11: <unknown function> + 0x10e3dd (0x558f465e83dd in /home/lzx/.conda/envs/seg/bin/python)
frame #12: <unknown function> + 0xf5777 (0x558f465cf777 in /home/lzx/.conda/envs/seg/bin/python)
frame #13: <unknown function> + 0xf57e3 (0x558f465cf7e3 in /home/lzx/.conda/envs/seg/bin/python)
frame #14: <unknown function> + 0xf5766 (0x558f465cf766 in /home/lzx/.conda/envs/seg/bin/python)
frame #15: <unknown function> + 0x1db5e3 (0x558f466b55e3 in /home/lzx/.conda/envs/seg/bin/python)
frame #16: _PyEval_EvalFrameDefault + 0x2a5a (0x558f466a7e4a in /home/lzx/.conda/envs/seg/bin/python)
frame #17: _PyFunction_FastCallKeywords + 0xfb (0x558f4663dccb in /home/lzx/.conda/envs/seg/bin/python)
frame #18: _PyEval_EvalFrameDefault + 0x6a3 (0x558f466a5a93 in /home/lzx/.conda/envs/seg/bin/python)
frame #19: _PyFunction_FastCallKeywords + 0xfb (0x558f4663dccb in /home/lzx/.conda/envs/seg/bin/python)
frame #20: _PyEval_EvalFrameDefault + 0x416 (0x558f466a5806 in /home/lzx/.conda/envs/seg/bin/python)
frame #21: _PyEval_EvalCodeWithName + 0x2f9 (0x558f465ee539 in /home/lzx/.conda/envs/seg/bin/python)
frame #22: _PyFunction_FastCallKeywords + 0x387 (0x558f4663df57 in /home/lzx/.conda/envs/seg/bin/python)
frame #23: _PyEval_EvalFrameDefault + 0x14dc (0x558f466a68cc in /home/lzx/.conda/envs/seg/bin/python)
frame #24: _PyEval_EvalCodeWithName + 0x2f9 (0x558f465ee539 in /home/lzx/.conda/envs/seg/bin/python)
frame #25: PyEval_EvalCodeEx + 0x44 (0x558f465ef424 in /home/lzx/.conda/envs/seg/bin/python)
frame #26: PyEval_EvalCode + 0x1c (0x558f465ef44c in /home/lzx/.conda/envs/seg/bin/python)
frame #27: <unknown function> + 0x22ab74 (0x558f46704b74 in /home/lzx/.conda/envs/seg/bin/python)
frame #28: PyRun_StringFlags + 0x7d (0x558f4670fddd in /home/lzx/.conda/envs/seg/bin/python)
frame #29: PyRun_SimpleStringFlags + 0x3f (0x558f4670fe3f in /home/lzx/.conda/envs/seg/bin/python)
frame #30: <unknown function> + 0x235f3d (0x558f4670ff3d in /home/lzx/.conda/envs/seg/bin/python)
frame #31: _Py_UnixMain + 0x3c (0x558f467102bc in /home/lzx/.conda/envs/seg/bin/python)
frame #32: __libc_start_main + 0xf0 (0x7ffa91705830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #33: <unknown function> + 0x1db062 (0x558f466b5062 in /home/lzx/.conda/envs/seg/bin/python)

/home/lzx/.conda/envs/seg/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 14 leaked semaphores to clean up at shutdown
  len(cache))
/home/lzx/.conda/envs/seg/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 14 leaked semaphores to clean up at shutdown
  len(cache))
/home/lzx/.conda/envs/seg/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 14 leaked semaphores to clean up at shutdown
  len(cache))
/home/lzx/.conda/envs/seg/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 14 leaked semaphores to clean up at shutdown
  len(cache))
/home/lzx/.conda/envs/seg/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 14 leaked semaphores to clean up at shutdown
  len(cache))
/home/lzx/.conda/envs/seg/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 14 leaked semaphores to clean up at shutdown
  len(cache))
Traceback (most recent call last):
  File "tool/train.py", line 488, in <module>
    main()
  File "tool/train.py", line 114, in main
    mp.spawn(main_worker, nprocs=args.ngpus_per_node, args=(args.ngpus_per_node, args))
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
    while not spawn_context.join():
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join
    raise Exception(msg)
Exception: 

-- Process 3 terminated with the following error:
Traceback (most recent call last):
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/home/lzx/segmentation_exp_LINUX/L012_cell/tool/train.py", line 298, in main_worker
    loss_train, mIoU_train, mAcc_train, allAcc_train = train(train_loader, model, optimizer, epoch)
  File "/home/lzx/segmentation_exp_LINUX/L012_cell/tool/train.py", line 351, in train
    scaled_loss.backward()
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/tensor.py", line 120, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/lzx/.conda/envs/seg/lib/python3.7/site-packages/torch/autograd/__init__.py", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
View Code

可是为什么最后还是说CUDNN有错,不是都关闭了吗?

 

根据:https://blog.csdn.net/qq_39938666/article/details/86611474

可能是python版本有问题。于是选择和他一样的3.6.6进行尝试

重新创建了Python=3.6.6的环境,还是不正确……

在别的卡都正确,只有使用GPU03不正确,说明代码没写错,就是卡的问题。或者就是不兼容

 

我尝试将这些卡调换顺序,但是报错一直都是将GPU02有问题。(以前是GPU03,后来不知道为什么一直是GPU03了,会不会是电源插口有问题?)

 

目前的报错(单卡GPU02)是cuda runtime error (77) : an illegal memory access was encountered

 

根据有的回答https://ethereum.stackexchange.com/questions/65652/error-cuda-mining-an-illegal-memory-access-was-encountered

我正在尝试将CUDA升级到10.1,目前是10.0

 

然而并没有用…… 先就用7块卡吧,

 

 

10月03日更新

 

今天算是解决了这个问题,误打误撞的

主要是参考了:https://github.com/pytorch/pytorch/issues/22050#issuecomment-521030783

这个人的头像我很熟悉,是pytorch论坛里经常回复别人消息的NVIDIA员工

他说除了使用conda。也要尝试使用pip安装。

于是我尝试了使用pip,将pytorch安装在别的环境里。然后别的环境和以前的环境,都不再有问题了。

这本质上应该是CUDNN被pip安装的东西覆盖了。大家都说这个是CUDNN有问题。可能pip的这个版本正好是OK的

 

大神给的命令行是

pip3 install torch torchvision

 

但是,大家都知道pip可能会更新源的包。所以这里贴一下我的实际的下载到的包:

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting torch
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/30/57/d5cceb0799c06733eefce80c395459f28970ebb9e896846ce96ab579a3f1/torch-1.2.0-cp36-cp36m-manylinux1_x86_64.whl (748.8MB)
     |████████████████████████████████| 748.9MB 68kB/s 
Collecting torchvision
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/06/e6/a564eba563f7ff53aa7318ff6aaa5bd8385cbda39ed55ba471e95af27d19/torchvision-0.4.0-cp36-cp36m-manylinux1_x86_64.whl (8.8MB)
     |████████████████████████████████| 8.8MB 1.0MB/s 
Requirement already satisfied: numpy in /home/lzx/.conda/envs/pytorchseg10/lib/python3.6/site-packages (from torch) (1.17.2)
Requirement already satisfied: pillow>=4.1.1 in /home/lzx/.conda/envs/pytorchseg10/lib/python3.6/site-packages (from torchvision) (6.1.0)
Requirement already satisfied: six in /home/lzx/.conda/envs/pytorchseg10/lib/python3.6/site-packages (from torchvision) (1.12.0)
Installing collected packages: torch, torchvision
Successfully installed torch-1.2.0 torchvision-0.4.0

 

仅供参考

 

posted on 2019-06-10 20:10  Oliver-cs  阅读(9713)  评论(1编辑  收藏  举报

导航