C++ Torch 从 OpenCV 的 GPU Mat 中加载数据

在视觉算法中,使用 GPU 加速图像处理是一种非常常见的手段。例如深度学习中常使用 Tensor.to("cuda") 将模型/图像放在 GPU 上运算。但在部署时 Pytorch 的 C++ 库 LibTorch 中并不包含 torchvision 等图像预处理用的模块,因此需要使用 OpenCV 代替。

最常见的处理流程如下:

flowchart TD subgraph CPU A[加载图像] A --> B[OpenCV 预处理] B --> C[Mat 转 Tensor] end subgraph GPU E[模型推理] E --> F[后处理] end C -->|"Tensor.to(torch::kCUDA)"| D[转移至 GPU] --> E F --> G[结果输出]

该流程的优势在于逻辑简单,易于编写。但是在处理速度敏感的场景下,可以将预处理部分也搬上 GPU 计算,并且让 LibTorch 直接从 GPU 上获取数据指针,能极快缩短计算时间,其流程如下:

flowchart TD subgraph CPU A[加载图像] end subgraph GPU B[OpenCV 预处理] B --> C[Mat 转 Tensor] C --> D[模型推理] D --> E[后处理] end A -->|"Mat.upload()"| B E --> G[结果输出]

这里先给出测试结果,根据笔者测试,将预处理搬至 GPU 上比预处理在 CPU 上运算能够缩短接近 4ms 的处理时间。

CPU Test Time: 5.12 ms
  Preprocess Time:  0.49 ms
  Convert Time:     3.10 ms
  Upload Time:      0.08 ms (Tensor.to(kCUDA))
  Backend Time:     1.45 ms

GPU Test Time: 1.57 ms
  Upload Time:      0.31 ms (Mat.upload())
  Preprocess Time:  0.10 ms
  Convert Time:     0.15 ms
  Backend Time:     1.01 ms

可以推断,OpenCV “使用 CPU 预处理图像”这部分的耗时和“上传至 GPU 并预处理图像(两个步骤)”的耗时几乎一致,但后续“从 Mat 建立 Tensor 并调整通道及转为半精度”这一部分 (Convert Time),因为 LibTorch 可以直接从 GPU 设备获取指针进行操作,其速度远远大于在 CPU 上处理。

下面给出测试的代码

#include <opencv2/cudaarithm.hpp>
#include <opencv2/cudawarping.hpp>
#include <opencv2/opencv.hpp>
#include <torch/cuda.h>
#include <torch/script.h>
#include <torch_tensorrt/torch_tensorrt.h>

int main() {
  cv::Mat image = cv::imread("000.png", cv::IMREAD_COLOR_RGB);
  torch::jit::Module backbone = torch::jit::load("bottle-wideresnet50.ts");

  constexpr float k_mean[3] = {0.485f, 0.456f, 0.406f};
  constexpr float k_std[3] = {0.229f, 0.224f, 0.225f};

  // ===== Warm Up =====
  {
    cv::cuda::GpuMat tmp;
    tmp.upload(image);
    cv::cuda::resize(tmp, tmp, cv::Size(232, 232));

    int offset = (232 - 224) / 2;
    cv::Rect roi(offset, offset, 224, 224);
    cv::cuda::GpuMat cropped_gpu = tmp(roi);
    std::vector<cv::cuda::GpuMat> channels(3);
    cv::cuda::split(cropped_gpu, channels);

    for (int c = 0; c < 3; ++c) {
      double scale = 1.0 / (255.0 * k_std[c]);
      double shift = -static_cast<double>(k_mean[c]) / k_std[c];
      channels[c].convertTo(channels[c], CV_32F, scale, shift);
    }

    cv::cuda::GpuMat normalized;
    cv::cuda::merge(channels, normalized);

    auto t = torch::from_blob(
        normalized.data,
        {224, 224, 3},
        {static_cast<int64_t>(normalized.step / sizeof(float)), 3, 1},
        torch::TensorOptions().dtype(torch::kFloat32).device(torch::kCUDA)
    ).clone();
    t = t.permute({2, 0, 1}).contiguous().unsqueeze(0);

    t = t.to(torch::kHalf);
    backbone.forward({t});
    torch::cuda::synchronize();
  }

  // ===== OpenCV 在 CPU 上执行 =====
  auto start_cpu = std::chrono::steady_clock::now();

  cv::Mat resized;
  cv::resize(image, resized, cv::Size(232, 232));
  int offset = (232 - 224) / 2;
  cv::Rect roi(offset, offset, 224, 224);
  cv::Mat cropped = resized(roi).clone();
  std::vector<cv::Mat> channels(3);
  cv::split(cropped, channels);

  for (int c = 0; c < 3; ++c) {
    double scale = 1.0 / (255.0 * k_std[c]);
    double shift = -static_cast<double>(k_mean[c]) / k_std[c];
    channels[c].convertTo(channels[c], CV_32F, scale, shift);
  }

  cv::Mat normalized;
  cv::merge(channels, normalized);

  torch::cuda::synchronize();
  auto end_cpu_preprocess = std::chrono::steady_clock::now();
  auto duration_cpu_preprocess = std::chrono::duration_cast<std::chrono::microseconds>(end_cpu_preprocess - start_cpu);
  auto start_cpu_convert = std::chrono::steady_clock::now();

  torch::Tensor tensor = torch::from_blob(
      normalized.data,
      {224, 224, 3},
      torch::TensorOptions().dtype(torch::kF32).device(torch::kCPU)
  ).clone();
  tensor = tensor.permute({2, 0, 1}).contiguous().unsqueeze(0);

  torch::cuda::synchronize();
  auto end_cpu_convert = std::chrono::steady_clock::now();
  auto duration_cpu_convert =
      std::chrono::duration_cast<std::chrono::microseconds>(end_cpu_convert - start_cpu_convert);
  auto start_cpu_upload = std::chrono::steady_clock::now();
  tensor = tensor.to(torch::TensorOptions().dtype(torch::kF16).device(torch::kCUDA));

  torch::cuda::synchronize();
  auto end_cpu_upload = std::chrono::steady_clock::now();
  auto duration_cpu_upload = std::chrono::duration_cast<std::chrono::microseconds>(end_cpu_upload - start_cpu_upload);
  auto start_cpu_backend = std::chrono::steady_clock::now();

  auto features = backbone.forward({tensor}).toTuple()->elements();
  for (const auto &feat : features) {
    auto feat_tensor = feat.toTensor();
  }

  torch::cuda::synchronize();
  auto end_cpu_backend = std::chrono::steady_clock::now();
  auto duration_cpu_backend =
      std::chrono::duration_cast<std::chrono::microseconds>(end_cpu_backend - start_cpu_backend);
  auto duration_cpu = std::chrono::duration_cast<std::chrono::microseconds>(end_cpu_backend - start_cpu);

  // ===== OpenCV 在 GPU 上执行 =====
  auto start_gpu = std::chrono::steady_clock::now();

  cv::cuda::GpuMat image_gpu;
  image_gpu.upload(image);

  torch::cuda::synchronize();
  auto end_gpu_upload = std::chrono::steady_clock::now();
  auto duration_gpu_upload = std::chrono::duration_cast<std::chrono::microseconds>(end_gpu_upload - start_gpu);
  auto start_gpu_preprocess = std::chrono::steady_clock::now();

  cv::cuda::GpuMat resized_gpu;
  cv::cuda::resize(image_gpu, resized_gpu, cv::Size(232, 232));
  cv::cuda::GpuMat cropped_gpu = resized_gpu(roi);
  std::vector<cv::cuda::GpuMat> channels_gpu(3);
  cv::cuda::split(cropped_gpu, channels_gpu);

  for (int c = 0; c < 3; ++c) {
    double scale = 1.0 / (255.0 * k_std[c]);
    double shift = -static_cast<double>(k_mean[c]) / k_std[c];
    channels_gpu[c].convertTo(channels_gpu[c], CV_32F, scale, shift);
  }

  cv::cuda::GpuMat normalized_gpu;
  cv::cuda::merge(channels_gpu, normalized_gpu);

  torch::cuda::synchronize();
  auto end_gpu_preprocess = std::chrono::steady_clock::now();
  auto duration_gpu_opencv =
      std::chrono::duration_cast<std::chrono::microseconds>(end_gpu_preprocess - start_gpu_preprocess);
  auto start_gpu_convert = std::chrono::steady_clock::now();

  torch::Tensor tensor_gpu = torch::from_blob(
      normalized_gpu.data,
      {224, 224, 3},
      {static_cast<int64_t>(normalized.step / sizeof(float)), 3, 1},
      torch::TensorOptions().dtype(torch::kF32).device(torch::kCUDA)
  ).clone();
  tensor_gpu = tensor_gpu.permute({2, 0, 1}).contiguous().unsqueeze(0);
  tensor_gpu = tensor_gpu.to(torch::kHalf);

  torch::cuda::synchronize();
  auto end_gpu_convert = std::chrono::steady_clock::now();
  auto duration_gpu_convert =
      std::chrono::duration_cast<std::chrono::microseconds>(end_gpu_convert - start_gpu_convert);
  auto start_gpu_backend = std::chrono::steady_clock::now();

  auto features2 = backbone.forward({tensor_gpu}).toTuple()->elements();
  for (const auto &feat : features2) {
    auto feat_tensor = feat.toTensor();
  }

  torch::cuda::synchronize();
  auto end_gpu_backend = std::chrono::steady_clock::now();
  auto duration_gpu_backend =
      std::chrono::duration_cast<std::chrono::microseconds>(end_gpu_backend - start_gpu_backend);
  auto duration_gpu = std::chrono::duration_cast<std::chrono::microseconds>(end_gpu_backend - start_gpu);

  std::cout << "CPU Test Time: " << std::fixed << std::setprecision(2) << duration_cpu.count() / 1000.0 << " ms"
            << std::endl;
  std::cout << "  Preprocess Time: " << std::fixed << std::setprecision(2) << duration_cpu_preprocess.count() / 1000.0
            << " ms" << std::endl;
  std::cout << "  Convert Time: " << std::fixed << std::setprecision(2) << duration_cpu_convert.count() / 1000.0
            << " ms" << std::endl;
  std::cout << "  Upload Time: " << std::fixed << std::setprecision(2) << duration_cpu_upload.count() / 1000.0 << " ms"
            << std::endl;
  std::cout << "  Backend Time: " << std::fixed << std::setprecision(2) << duration_cpu_backend.count() / 1000.0
            << " ms" << std::endl;

  std::cout << std::endl;

  std::cout << "GPU Test Time: " << std::fixed << std::setprecision(2) << duration_gpu.count() / 1000.0 << " ms"
            << std::endl;
  std::cout << "  Upload Time: " << std::fixed << std::setprecision(2) << duration_gpu_upload.count() / 1000.0 << " ms"
            << std::endl;
  std::cout << "  Preprocess Time: " << std::fixed << std::setprecision(2) << duration_gpu_opencv.count() / 1000.0
            << " ms" << std::endl;
  std::cout << "  Convert Time: " << std::fixed << std::setprecision(2) << duration_gpu_convert.count() / 1000.0
            << " ms" << std::endl;
  std::cout << "  Backend Time: " << std::fixed << std::setprecision(2) << duration_gpu_backend.count() / 1000.0
            << " ms" << std::endl;
}

该段代码的重点在于 LibTorch 从 cv::cuda::GpuMat 上获取数据这一部分

torch::Tensor tensor_gpu = torch::from_blob(
    normalized_gpu.data, {224, 224, 3},
    {static_cast<int64_t>(normalized.step / sizeof(float)), 3, 1},
    torch::TensorOptions().dtype(torch::kF32).device(torch::kCUDA)
).clone();

torch::from_blob() 的第一个参数指定了数据指针,第二个参数决定了该数据通道是如何排列的(OpenCV 默认读取通道为 HWC),第四个参数指定了该数据的数据类型即其所在设备,最后因为 torch::from_blob() 只是建立了一个引用(视图),而不是数据的深拷贝(指针),当 cv::Mat 数据被销毁, torch::Tensor 的数据指针会变成野指针,因此如果是在函数内部将 torch::Tensor 返回,则需要拷贝。重点在于第三个参数:指定读取数据的步长。这里可以首先看一下 CPU 版本是如何建立 Tensor 的。

torch::Tensor tensor = torch::from_blob(
    normalized.data,
    {224, 224, 3},
    torch::TensorOptions().dtype(torch::kF32).device(torch::kCPU)
).clone();

CPU 版本除了第四个参数 torch::kCPU 和第三个参数(不存在)不一致,其他均一致。可以看出,CPU 版本不存在指定读取数据的步长,而 GPU 版本需要,其根本原因在于内存连续性。

GPU 内存访问有对齐要求,每行起始地址必须对齐到某个边界(比如256字节、512字节),这样 GPU 读取效率最高。因此 cv::cuda::GpuMat 在 GPU 上分配内存时,每行会有内存对齐(padding),导致实际内存布局不是紧密连续的。具体可参考 OpenCV 官方文档。内存视图大概如下

|R G B ... R G B | padding |   ← 第0行
|R G B ... R G B | padding |   ← 第1行

因此对于参数 {static_cast<int64_t>(normalized.step / sizeof(float)), 3, 1},其对应 HWC 格式,因此第一个参数表示当跨行时,跳到下一行需要走多少个 float;第二个参数表示当跨列时,跳到下一个元素需要走过多少个 float,因为图像有 RGB 三个通道,因此跨列数是 3;第三个参数表示跨通道时,跳到下一个通道需要走多少个 float。其中 .step 表示每行数据占用的实际字节数,也可称为“行跨度”。

同理可以推导,如果当 torch::from_blob 读取到的内存布局是 CHW 时,除了第二个参数要改为形如 {3, 224, 244} 格式,第三个参数也应当改为 {H*W, W, 1},因为 CHW 的内存排列是:先存完整个 C0通道,再存 C1,C2。其内存视图大致如下

[R0 R1 R2 ... R(H*W-1) | G0 G1 G2 ... G(H*W-1) | B0 B1 B2 ... B(H*W-1)]
posted @ 2026-03-16 16:10  絵守辛玥  阅读(5)  评论(0)    收藏  举报