问题记录贴

torch转onnx

在转一个并不复杂的模型的时候出现错误.模型并不存在什么复杂的算子.

RuntimeError: tuple appears in op that does not forward tuples (VisitNode at /pytorch/torch/csrc/jit/passes/lower_tuples.cpp:109)
frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7f27f31b3fe1 in /home/train/.local/lib/python3.5/site-packages/torch/lib/libc10.so)
frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7f27f31b3dfa in /home/train/.local/lib/python3.5/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0x6da2e1 (0x7f27def7f2e1 in /home/train/.local/lib/python3.5/site-packages/torch/lib/libtorch.so.1)
frame #3: <unknown function> + 0x6da534 (0x7f27def7f534 in /home/train/.local/lib/python3.5/site-packages/torch/lib/libtorch.so.1)

搜索后发现和这个错误类似.
DataParallel移除即可.

model = mobilenetv2()
model = torch.nn.DataParallel(model).cuda()

改为

model = mobilenetv2().cuda()

matplotlib绘图问题

https://stackoverflow.com/questions/33676608/pandas-type-error-trying-to-plot

plt.scatter(df.Time, y=df.Value, marker='o')绘图时,如果df.Time是pandas中的datetime类型时会出错.

fig = plt.figure(figsize=(x_size,y_size))
ax = fig.add_subplot(111)
ax.scatter(df.Time, y=df.Value, marker='o')

改为

fig = plt.figure(figsize=(x_size,y_size))
ax = fig.add_subplot(111)
ax.plot_date(x=df.Time, y=df.Value, marker='o')

或者

fig = plt.figure(figsize=(x_size,y_size))
ax = fig.add_subplot(111)
ax.scatter(list(df.Time.values), list(df.Value.values), marker='o')

At least one stride in the given numpy array is negative, and tensors with negative strides are not currently supported

ValueError: At least one stride in the given numpy array is negative, and tensors with negative strides are not currently supported. (You can probably work around this by making a copy of your array  with array.copy().) 

ndarray转tensor时要求内存连续.
https://www.cnblogs.com/devilmaycry812839668/p/13761613.html

Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

报错在optimizer.step()这一步.
原因:模型保存的参数是gpu显存上的. 在加载权重文件之前,需要先将数据从cpu迁移到gpu.
即:

        checkpoint = torch.load('checkpoints/latest2.pt')
        yolov3net.load_state_dict(checkpoint['model'])
        optimizer = torch.optim.Adam(yolov3net.parameters())
        optimizer.load_state_dict(checkpoint['optimizer'])

改为

        yolov3net = yolov3net.cuda()
        checkpoint = torch.load('checkpoints/latest2.pt')
        yolov3net.load_state_dict(checkpoint['model'])
        optimizer = torch.optim.Adam(yolov3net.parameters())
        optimizer.load_state_dict(checkpoint['optimizer'])

model.train() model.eval()

在用很小的数据集训练的时候,在训练时,模型表现的很好.保存模型. 但是在用model.eval()模式做推理时,表现差异很大.
奇怪的是,在我把自定义layer的forward方法里对training和非traing时的逻辑改为完全一致时,在model.eval()时模型推理结果还是和训练阶段有很大差异.
是因为bn层在训练和推理时,其forward行为是不同的. BatcNormalization.推理模式下,用的均值和方差是全样本方差.
当数据集特别小时,很容易过拟合.导致train模式下和eval模式下的差异很大.

posted @ 2020-07-14 14:58  core!  阅读(927)  评论(0编辑  收藏  举报