RuntimeError: CUDA error: device-side assert triggered

调试diffusion模型时在loss处报错,报错位置:

`acc_train_loss += loss.item()`

RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

在训练的损失函数位置加了断点,进行调试,发现了 unable to get repr for <class ‘torch.Tensor‘> 的错误信息

参考https://discuss.pytorch.org/t/unable-to-get-repr-for-class-torch-tensor/115627/3,使用cpu运行,报错
out = a.gather(-1, t) RuntimeError: index -1 is out of bounds for dimension 0 with size 10
确实是越界问题,在采样t-1时的样本时,t=0发生越界

修改:添加t = t.clamp_min(0),去除负值

posted @ 2023-02-09 14:38  不要肥宅  阅读(2180)  评论(0)    收藏  举报