TensorRT 流程解读

build_engine:
        trt.Builder(TRT_LOGGER)
              ↓         ↓
          network    config
              ↓         ↓
         Snetwork    Sconfig
              ↓         ↓
        plan(serialized_network)      runtime
                    ↓                   ↓
             engine(deserialize_cuda_engine)
                    ↓
                context         stream
                    ↓             ↓
          do_inference(CPU&GPU data transfer, GPU inference)

# Python
    builder = trt.Builder(TRT_LOGGER)
    network = builder.create_network(0)
    config = builder.create_builder_config()
    runtime = trt.Runtime(TRT_LOGGER)

    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, common.GiB(1))
    # 填充 the network using weights from the PyTorch model.
    populate_network(network, weights)
    # Build and return an engine.
    plan = builder.build_serialized_network(network, config)
    engine = runtime.deserialize_cuda_engine(plan)

    inputs, outputs, bindings, stream = common.allocate_buffers(engine)
    context = engine.create_execution_context()

    [output] = common.do_inference(
        context,
        engine=engine,
        bindings=bindings,
        inputs=inputs,
        outputs=outputs,
        stream=stream,
    )

    common.free_buffers(inputs, outputs, stream)

builder: 负责将原始模型(如 ONNX)转换为高性能推理计算图,解析网络结构、应用图优化、配置精度策略、生成序列化的优化引擎;
runtime: 负责加载和反序列化优化后的引擎文件(推理计算图)、管理GPU&CPU硬件资源
engine: 包含经过编译的模型结构、权重参数及针对特定硬件优化的计算内核配置。本质上是将原始模型(如ONNX)通过层融合、量化等技术优化后的高性能推理计算图;
context: 是Engine的运行时实例,负责管理推理过程中的动态状态(如动态输入形状、中间张量),​​一个Engine可创建多个Context,支持多流并行推理​​;
inputs: 用户数据的输入
outputs: 模型运行后的输出
bindings: 是​​输入/输出张量在Engine中的逻辑索引
stream: CUDA流,是CUDA操作的异步执行序列,用于​​管理数据传输与内核执行的并发性​,避免CPU/GPU空闲等待,提升吞吐量;

posted @ 2025-08-05 15:21  qccz123456  阅读(13)  评论(0)    收藏  举报