libtorch部署在c++ QT问题记录

最近在研究如何将pytorch的目标检测模型部署在c++上进行预测正向传播，得到结果，以便部署在之前编写的QT系统程序上
由于刚接触深度学习和pytorch,在过程中遇到了很多问题，在这里写一些记录，方便以后查询

环境如下：

win10
cuda11.3
python3.8
VS2019

将python环境下的目标检测网络模型model,保存成pt格式
模型来源:
https://github.com/WZMIAOMIAO/deep-learning-for-image-processing
主要是想把其中的resnet50_fpn检测模型部署到c++上
方法使用torch.jit.trace

在predict.py里面首先找到模型位置（已训练保存好了模型权重）代码如下：

    model = create_model(num_classes=3)
    # load train weights
    train_weights = "./save_weights/resNetFpn-model-14.pth"
    assert os.path.exists(train_weights), "{} file dose not exist.".format(train_weights)
    model.load_state_dict(torch.load(train_weights, map_location=device)["model"])
    model.to(device)

稍作修改，将json文件替换成my_classes.json将模型加载完成之后，需要使用trace方法，trace好像英文是跟踪的意思，由于需要导入一张图片进行输入，应该是记录该输入进行正向传播的过程，将其保存下来。

图像预处理，调试得到模型输入的图片类型

图示位置打上断点进行调试，查看原始图片信息，得知该图片为单通道，size为（2048，2448）
现在图片是pilimage类型，需要将其转成tensor类型输入到网络中，需要记录其输入的tensorshape，将c++模型的输入也调整成相同shape，才能正确预测。
主要处理代码如下：

    data_transform = transforms.Compose([transforms.ToTensor()])
    img = data_transform(original_img)
    img = torch.unsqueeze(img, dim=0)

上面三行代码已经完成预处理工作，主要是将PilImage->tensor,然后再给img增加一个bacth维度
最后img的shape和size为
（1，3，2048，2448) dtype为float32
3.获取模型输出类型

    model.eval()  # 进入验证模式
    with torch.no_grad():
        # init
        img_height, img_width = img.shape[-2:]
        init_img = torch.zeros((1, 3, img_height, img_width), device=device)
        output = model(init_img)

        # device = torch.device('cuda:0')
        img=img.to(device)
        outputs = model(img)
        object = outputs[0]

通过调试得到模型输出为output为一个list,其包含3个字典，分别是
box：目标检测框信息，
labale:标签，
score:目标得分

3.最后一步

定义jit模型

trace_script_module = torch.jit.trace(model, img)
trace_script_module.eval()
output = trace_script_module(img)
# print(type(output), output[0, :10], output.shape)
trace_script_module.save("./save_weights/resNetFpn-model-14_jit.pt")

按理说到这里应该大功告成了，真的有这么简单吗，果然有坑啊

坑1：trace方法提示模型返回值不匹配（未解决）

运行提示报错
RuntimeError: Only tensors, lists, tuples of tensors, or dictionary of tensors can be output from traced functions
提示我只有tensor,list，tuple,dict为模型返回值时trace方法才能使用
难道model输出不是其中类型的吗
从前面调试不是已经得出模型输出是list吗，咋还是不行呢，再议再议，不会了2022.2.9

注：之后尝试强行修改过模型返回值，改成tensor变量，但出现与坑3类似nms不匹配问题

坑2：torch.script包容性比torch.trace强（未解决，绕过问题）

经过一天的查阅资料（百度），了解到可以用torch.jit.script创建模型，这种方式我理解的意思就说没trace方法那么严格，期间也了解到trace方法对于if判断等随机性的操作会无法处理，于是采用script方法进行模型的保存，结果居然真的顺利保存成功了，保存的代码是这样的：

script_model = torch.jit.script(model)
script_model.save(".xxx.pt")

导入到vs中，居然编译通过！

坑3：torchvision.nms函数无法匹配（未解决）

但是运行依旧报错，编译器捕捉到了异常error:torch::jit::ErrorReport
通过用try catch结构，使用以下结构打印异常信息

catch (std::exception& e)
{
    std::cout << e.what() << std::endl;
}

异常信息为：

Unknown builtin op: torchvision::nms.
Could not find any similar ops to torchvision::nms. This op may not exist or may not be currently supported in TorchScript.
:
File "D:\xxxx\boxes.py", line 35
by NMS, sorted in decreasing order of scores
"""
return torch.ops.torchvision.nms(boxes, scores, iou_threshold)
~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
Serialized File "code/torch/network_files/boxes.py", line 70
scores: Tensor,
iou_threshold: float) -> Tensor:
_24 = ops.torchvision.nms(boxes, scores, iou_threshold)
~~~~~~~~~~~~~~~~~~~ <--- H

torchvision.nms这个函数好像无法进行保存，找不到对应的torchvision::nms函数，
替换nms函数：不使用用torchvision库，结果搜索使用了TeddyZhang的nms算法实现，大家可以移步这里

https://zhuanlan.zhihu.com/p/54709759

将nms函数拼接到fasterrcnn的rpn网络后（我简直就是缝合怪），使用script方法进行保存，出现主要bug信息如下：

RuntimeError:
aten::append.t(t self, t(c -> *) el) -> (t):
Could not match type number to t in argument 'el': Type variable 't' previously matched to type Tensor is matched to type number.
'mynms' is being compiled since it was called from 'batched_nms'
'batched_nms' is being compiled since it was called from 'RegionProposalNetwork.filter_proposals'
'RegionProposalNetwork.filter_proposals' is being compiled since it was called from 'RegionProposalNetwork.forward'

刚开始看提示觉得应该是语法出了问题才无法通过，但是我尝试把torch.script模型保存代码删除后，可以正常运行，且两种nms返回的结果相同，我觉得还是模型本身问题，太过复杂，内部杂糅了很多东西导致无法正常导出。

坑4：libtorch预测模型提示返回值不是tensor(未解决)

看到小标题就应该我已经导出加载成功了!由于之前的网络包含了很多部分，其中rpn的nms让我无法解决，于是乎我就想先导出backbone模型，测试一下能否正常预测，通过script方法可以正常导出，加载完成，打上一些必须的代码后，还是报错了！

torch::NoGradGuard no_grad;
transformjitscript_module.to(device_type);
backbonejitscript_module.to(device_type);
tensorimage.to(device_type);
transformjitscript_module.eval();
backbonejitscript_module.eval();

std::vector<torch::jit::IValue> inputs;
inputs.push_back(tensorimage);
try {
	auto outputs = transformjitscript_module.forward(inputs).toTensor();
}
catch (std::exception& e)
{
	std::cout << e.what() << std::endl;
}

错误提示：

Expected Tensor but got GenericDict

咋办，我已经在pytorch环境下确认过输出的是tensor,这个报错很迷啊，崩溃！！！
2022.02.11 00:25

最近进展：已经成功加载模型的一部分，并获得了预测结果，但仍然无法完全加载成功，又研究了以下其他人的教程以及libtorch官方文档发现前面所写和理解的有很多错误，以后再改把。

posted @ 2022-02-09 16:44 LV426 阅读(382) 评论(0) 收藏举报

刷新页面返回顶部

我的小屋

libtorch部署在c++ QT问题记录

坑1：trace方法提示模型返回值不匹配（未解决）

坑2：torch.script包容性比torch.trace强（未解决，绕过问题）

坑3：torchvision.nms函数无法匹配（未解决）

坑4：libtorch预测模型提示返回值不是tensor(未解决)

公告

我的小屋

libtorch部署在c++ QT问题记录

坑1：trace方法提示模型返回值不匹配（未解决）

坑2：torch.script包容性比torch.trace强 （未解决，绕过问题）

坑3：torchvision.nms函数无法匹配（未解决）

坑4：libtorch预测模型提示返回值不是tensor(未解决)

公告

坑2：torch.script包容性比torch.trace强（未解决，绕过问题）