TensorRT 流程解读
build_engine:
trt.Builder(TRT_LOGGER)
↓ ↓
network config
↓ ↓
Snetwork Sconfig
↓ ↓
plan(serialized_network) runtime
↓ ↓
engine(deserialize_cuda_engine)
↓
context stream
↓ ↓
do_inference(CPU&GPU data transfer, GPU inference)
# Python
builder = trt.Builder(TRT_LOGGER)
network = builder.create_network(0)
config = builder.create_builder_config()
runtime = trt.Runtime(TRT_LOGGER)
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, common.GiB(1))
# 填充 the network using weights from the PyTorch model.
populate_network(network, weights)
# Build and return an engine.
plan = builder.build_serialized_network(network, config)
engine = runtime.deserialize_cuda_engine(plan)
inputs, outputs, bindings, stream = common.allocate_buffers(engine)
context = engine.create_execution_context()
[output] = common.do_inference(
context,
engine=engine,
bindings=bindings,
inputs=inputs,
outputs=outputs,
stream=stream,
)
common.free_buffers(inputs, outputs, stream)
builder: 负责将原始模型(如 ONNX)转换为高性能推理计算图,解析网络结构、应用图优化、配置精度策略、生成序列化的优化引擎;
runtime: 负责加载和反序列化优化后的引擎文件(推理计算图)、管理GPU&CPU硬件资源
engine: 包含经过编译的模型结构、权重参数及针对特定硬件优化的计算内核配置。本质上是将原始模型(如ONNX)通过层融合、量化等技术优化后的高性能推理计算图;
context: 是Engine的运行时实例,负责管理推理过程中的动态状态(如动态输入形状、中间张量),一个Engine可创建多个Context,支持多流并行推理;
inputs: 用户数据的输入
outputs: 模型运行后的输出
bindings: 是输入/输出张量在Engine中的逻辑索引
stream: CUDA流,是CUDA操作的异步执行序列,用于管理数据传输与内核执行的并发性,避免CPU/GPU空闲等待,提升吞吐量;

浙公网安备 33010602011771号