C++ Torch 从 OpenCV 的 GPU Mat 中加载数据
在视觉算法中,使用 GPU 加速图像处理是一种非常常见的手段。例如深度学习中常使用 Tensor.to("cuda") 将模型/图像放在 GPU 上运算。但在部署时 Pytorch 的 C++ 库 LibTorch 中并不包含 torchvision 等图像预处理用的模块,因此需要使用 OpenCV 代替。
最常见的处理流程如下:
该流程的优势在于逻辑简单,易于编写。但是在处理速度敏感的场景下,可以将预处理部分也搬上 GPU 计算,并且让 LibTorch 直接从 GPU 上获取数据指针,能极快缩短计算时间,其流程如下:
这里先给出测试结果,根据笔者测试,将预处理搬至 GPU 上比预处理在 CPU 上运算能够缩短接近 4ms 的处理时间。
CPU Test Time: 5.12 ms
Preprocess Time: 0.49 ms
Convert Time: 3.10 ms
Upload Time: 0.08 ms (Tensor.to(kCUDA))
Backend Time: 1.45 ms
GPU Test Time: 1.57 ms
Upload Time: 0.31 ms (Mat.upload())
Preprocess Time: 0.10 ms
Convert Time: 0.15 ms
Backend Time: 1.01 ms
可以推断,OpenCV “使用 CPU 预处理图像”这部分的耗时和“上传至 GPU 并预处理图像(两个步骤)”的耗时几乎一致,但后续“从 Mat 建立 Tensor 并调整通道及转为半精度”这一部分 (Convert Time),因为 LibTorch 可以直接从 GPU 设备获取指针进行操作,其速度远远大于在 CPU 上处理。
下面给出测试的代码
#include <opencv2/cudaarithm.hpp>
#include <opencv2/cudawarping.hpp>
#include <opencv2/opencv.hpp>
#include <torch/cuda.h>
#include <torch/script.h>
#include <torch_tensorrt/torch_tensorrt.h>
int main() {
cv::Mat image = cv::imread("000.png", cv::IMREAD_COLOR_RGB);
torch::jit::Module backbone = torch::jit::load("bottle-wideresnet50.ts");
constexpr float k_mean[3] = {0.485f, 0.456f, 0.406f};
constexpr float k_std[3] = {0.229f, 0.224f, 0.225f};
// ===== Warm Up =====
{
cv::cuda::GpuMat tmp;
tmp.upload(image);
cv::cuda::resize(tmp, tmp, cv::Size(232, 232));
int offset = (232 - 224) / 2;
cv::Rect roi(offset, offset, 224, 224);
cv::cuda::GpuMat cropped_gpu = tmp(roi);
std::vector<cv::cuda::GpuMat> channels(3);
cv::cuda::split(cropped_gpu, channels);
for (int c = 0; c < 3; ++c) {
double scale = 1.0 / (255.0 * k_std[c]);
double shift = -static_cast<double>(k_mean[c]) / k_std[c];
channels[c].convertTo(channels[c], CV_32F, scale, shift);
}
cv::cuda::GpuMat normalized;
cv::cuda::merge(channels, normalized);
auto t = torch::from_blob(
normalized.data,
{224, 224, 3},
{static_cast<int64_t>(normalized.step / sizeof(float)), 3, 1},
torch::TensorOptions().dtype(torch::kFloat32).device(torch::kCUDA)
).clone();
t = t.permute({2, 0, 1}).contiguous().unsqueeze(0);
t = t.to(torch::kHalf);
backbone.forward({t});
torch::cuda::synchronize();
}
// ===== OpenCV 在 CPU 上执行 =====
auto start_cpu = std::chrono::steady_clock::now();
cv::Mat resized;
cv::resize(image, resized, cv::Size(232, 232));
int offset = (232 - 224) / 2;
cv::Rect roi(offset, offset, 224, 224);
cv::Mat cropped = resized(roi).clone();
std::vector<cv::Mat> channels(3);
cv::split(cropped, channels);
for (int c = 0; c < 3; ++c) {
double scale = 1.0 / (255.0 * k_std[c]);
double shift = -static_cast<double>(k_mean[c]) / k_std[c];
channels[c].convertTo(channels[c], CV_32F, scale, shift);
}
cv::Mat normalized;
cv::merge(channels, normalized);
torch::cuda::synchronize();
auto end_cpu_preprocess = std::chrono::steady_clock::now();
auto duration_cpu_preprocess = std::chrono::duration_cast<std::chrono::microseconds>(end_cpu_preprocess - start_cpu);
auto start_cpu_convert = std::chrono::steady_clock::now();
torch::Tensor tensor = torch::from_blob(
normalized.data,
{224, 224, 3},
torch::TensorOptions().dtype(torch::kF32).device(torch::kCPU)
).clone();
tensor = tensor.permute({2, 0, 1}).contiguous().unsqueeze(0);
torch::cuda::synchronize();
auto end_cpu_convert = std::chrono::steady_clock::now();
auto duration_cpu_convert =
std::chrono::duration_cast<std::chrono::microseconds>(end_cpu_convert - start_cpu_convert);
auto start_cpu_upload = std::chrono::steady_clock::now();
tensor = tensor.to(torch::TensorOptions().dtype(torch::kF16).device(torch::kCUDA));
torch::cuda::synchronize();
auto end_cpu_upload = std::chrono::steady_clock::now();
auto duration_cpu_upload = std::chrono::duration_cast<std::chrono::microseconds>(end_cpu_upload - start_cpu_upload);
auto start_cpu_backend = std::chrono::steady_clock::now();
auto features = backbone.forward({tensor}).toTuple()->elements();
for (const auto &feat : features) {
auto feat_tensor = feat.toTensor();
}
torch::cuda::synchronize();
auto end_cpu_backend = std::chrono::steady_clock::now();
auto duration_cpu_backend =
std::chrono::duration_cast<std::chrono::microseconds>(end_cpu_backend - start_cpu_backend);
auto duration_cpu = std::chrono::duration_cast<std::chrono::microseconds>(end_cpu_backend - start_cpu);
// ===== OpenCV 在 GPU 上执行 =====
auto start_gpu = std::chrono::steady_clock::now();
cv::cuda::GpuMat image_gpu;
image_gpu.upload(image);
torch::cuda::synchronize();
auto end_gpu_upload = std::chrono::steady_clock::now();
auto duration_gpu_upload = std::chrono::duration_cast<std::chrono::microseconds>(end_gpu_upload - start_gpu);
auto start_gpu_preprocess = std::chrono::steady_clock::now();
cv::cuda::GpuMat resized_gpu;
cv::cuda::resize(image_gpu, resized_gpu, cv::Size(232, 232));
cv::cuda::GpuMat cropped_gpu = resized_gpu(roi);
std::vector<cv::cuda::GpuMat> channels_gpu(3);
cv::cuda::split(cropped_gpu, channels_gpu);
for (int c = 0; c < 3; ++c) {
double scale = 1.0 / (255.0 * k_std[c]);
double shift = -static_cast<double>(k_mean[c]) / k_std[c];
channels_gpu[c].convertTo(channels_gpu[c], CV_32F, scale, shift);
}
cv::cuda::GpuMat normalized_gpu;
cv::cuda::merge(channels_gpu, normalized_gpu);
torch::cuda::synchronize();
auto end_gpu_preprocess = std::chrono::steady_clock::now();
auto duration_gpu_opencv =
std::chrono::duration_cast<std::chrono::microseconds>(end_gpu_preprocess - start_gpu_preprocess);
auto start_gpu_convert = std::chrono::steady_clock::now();
torch::Tensor tensor_gpu = torch::from_blob(
normalized_gpu.data,
{224, 224, 3},
{static_cast<int64_t>(normalized.step / sizeof(float)), 3, 1},
torch::TensorOptions().dtype(torch::kF32).device(torch::kCUDA)
).clone();
tensor_gpu = tensor_gpu.permute({2, 0, 1}).contiguous().unsqueeze(0);
tensor_gpu = tensor_gpu.to(torch::kHalf);
torch::cuda::synchronize();
auto end_gpu_convert = std::chrono::steady_clock::now();
auto duration_gpu_convert =
std::chrono::duration_cast<std::chrono::microseconds>(end_gpu_convert - start_gpu_convert);
auto start_gpu_backend = std::chrono::steady_clock::now();
auto features2 = backbone.forward({tensor_gpu}).toTuple()->elements();
for (const auto &feat : features2) {
auto feat_tensor = feat.toTensor();
}
torch::cuda::synchronize();
auto end_gpu_backend = std::chrono::steady_clock::now();
auto duration_gpu_backend =
std::chrono::duration_cast<std::chrono::microseconds>(end_gpu_backend - start_gpu_backend);
auto duration_gpu = std::chrono::duration_cast<std::chrono::microseconds>(end_gpu_backend - start_gpu);
std::cout << "CPU Test Time: " << std::fixed << std::setprecision(2) << duration_cpu.count() / 1000.0 << " ms"
<< std::endl;
std::cout << " Preprocess Time: " << std::fixed << std::setprecision(2) << duration_cpu_preprocess.count() / 1000.0
<< " ms" << std::endl;
std::cout << " Convert Time: " << std::fixed << std::setprecision(2) << duration_cpu_convert.count() / 1000.0
<< " ms" << std::endl;
std::cout << " Upload Time: " << std::fixed << std::setprecision(2) << duration_cpu_upload.count() / 1000.0 << " ms"
<< std::endl;
std::cout << " Backend Time: " << std::fixed << std::setprecision(2) << duration_cpu_backend.count() / 1000.0
<< " ms" << std::endl;
std::cout << std::endl;
std::cout << "GPU Test Time: " << std::fixed << std::setprecision(2) << duration_gpu.count() / 1000.0 << " ms"
<< std::endl;
std::cout << " Upload Time: " << std::fixed << std::setprecision(2) << duration_gpu_upload.count() / 1000.0 << " ms"
<< std::endl;
std::cout << " Preprocess Time: " << std::fixed << std::setprecision(2) << duration_gpu_opencv.count() / 1000.0
<< " ms" << std::endl;
std::cout << " Convert Time: " << std::fixed << std::setprecision(2) << duration_gpu_convert.count() / 1000.0
<< " ms" << std::endl;
std::cout << " Backend Time: " << std::fixed << std::setprecision(2) << duration_gpu_backend.count() / 1000.0
<< " ms" << std::endl;
}
该段代码的重点在于 LibTorch 从 cv::cuda::GpuMat 上获取数据这一部分
torch::Tensor tensor_gpu = torch::from_blob(
normalized_gpu.data, {224, 224, 3},
{static_cast<int64_t>(normalized.step / sizeof(float)), 3, 1},
torch::TensorOptions().dtype(torch::kF32).device(torch::kCUDA)
).clone();
torch::from_blob() 的第一个参数指定了数据指针,第二个参数决定了该数据通道是如何排列的(OpenCV 默认读取通道为 HWC),第四个参数指定了该数据的数据类型即其所在设备,最后因为 torch::from_blob() 只是建立了一个引用(视图),而不是数据的深拷贝(指针),当 cv::Mat 数据被销毁, torch::Tensor 的数据指针会变成野指针,因此如果是在函数内部将 torch::Tensor 返回,则需要拷贝。重点在于第三个参数:指定读取数据的步长。这里可以首先看一下 CPU 版本是如何建立 Tensor 的。
torch::Tensor tensor = torch::from_blob(
normalized.data,
{224, 224, 3},
torch::TensorOptions().dtype(torch::kF32).device(torch::kCPU)
).clone();
CPU 版本除了第四个参数 torch::kCPU 和第三个参数(不存在)不一致,其他均一致。可以看出,CPU 版本不存在指定读取数据的步长,而 GPU 版本需要,其根本原因在于内存连续性。
GPU 内存访问有对齐要求,每行起始地址必须对齐到某个边界(比如256字节、512字节),这样 GPU 读取效率最高。因此 cv::cuda::GpuMat 在 GPU 上分配内存时,每行会有内存对齐(padding),导致实际内存布局不是紧密连续的。具体可参考 OpenCV 官方文档。内存视图大概如下
|R G B ... R G B | padding | ← 第0行
|R G B ... R G B | padding | ← 第1行
因此对于参数 {static_cast<int64_t>(normalized.step / sizeof(float)), 3, 1},其对应 HWC 格式,因此第一个参数表示当跨行时,跳到下一行需要走多少个 float;第二个参数表示当跨列时,跳到下一个元素需要走过多少个 float,因为图像有 RGB 三个通道,因此跨列数是 3;第三个参数表示跨通道时,跳到下一个通道需要走多少个 float。其中 .step 表示每行数据占用的实际字节数,也可称为“行跨度”。
同理可以推导,如果当 torch::from_blob 读取到的内存布局是 CHW 时,除了第二个参数要改为形如 {3, 224, 244} 格式,第三个参数也应当改为 {H*W, W, 1},因为 CHW 的内存排列是:先存完整个 C0通道,再存 C1,C2。其内存视图大致如下
[R0 R1 R2 ... R(H*W-1) | G0 G1 G2 ... G(H*W-1) | B0 B1 B2 ... B(H*W-1)]

浙公网安备 33010602011771号