TensorRT学习（三） TensorRT Developer Guide：

二、TensorRT

TensorRT Developer Guide：如何使用C ++和Python API来实现最常见的深度学习层。它展示了如何采用深度学习框架构建的现有模型，并使用该模型通过提供的解析器构建TensorRT引擎。开发人员指南还提供了针对常见用户任务的分步说明，例如创建TensorRT网络定义，调用TensorRT构建器，序列化和反序列化以及如何向引擎提供数据并执行推理；同时使用C ++或Python API。

附带项目地址：www.github.com

2.使用C ++ API从头开始创建网络定义

2.1、初始化，在C ++中实例化TensorRT对象。

A method called createNetwork defined for IBuilder is used to create an object of type INetworkDefinition.

A method called parse() from the object of type IParser is called to read the model file and populate the TensorRT network.

A method called buildCudaEngine() of IBuilder is called to create an object of ICudaEngine type.

A global TensorRT API method called createInferRuntime(gLogger) is used to create an object of type IRuntime.

One of the available parsers is created (Caffe, ONNX, or UFF) using the INetwork definition as the input:

An object of type ILogger needs to be created globally.

 1 2.1. Instantiating TensorRT Objects in C++
 2 
 3 class Logger : public ILogger           
 4  {
 5      void log(Severity severity, const char* msg) override
 6      {
 7          // suppress info-level messages
 8          if (severity != Severity::kINFO)
 9              std::cout << msg << std::endl;
10      }
11  } gLogger;
12 
13 ONNX: 
14 auto parser = nvonnxparser::createParser(*network, gLogger);
15 Caffe:
16 auto parser = nvcaffeparser1::createCaffeParser();
17 UFF: 
18 auto parser = nvuffparser::createUffParser();

View Code

2.2、创建一个具有输入，卷积，池化，完全连接，激活和SoftMax层的简单网络：

//创建构建器和网络：
IBuilder* builder = createInferBuilder(gLogger);
INetworkDefinition* network = builder->createNetwork();
//将输入图层和输入尺寸添加到网络。一个网络可以有多个输入，尽管在此示例中只有一个：
auto data = network->addInput(INPUT_BLOB_NAME, dt, Dims3{1, INPUT_H, INPUT_W});
//添加具有隐藏层输入节点，步幅和权重的卷积层以进行滤波和偏置。为了从图层中检索张量参考，我们可以使用：
auto conv1 = network->addConvolution(*data->getOutput(0), 20, DimsHW{5, 5}, weightMap["conv1filter"], weightMap["conv1bias"]);
conv1->setStride(DimsHW{1, 1});
//注意：传递到TensorRT层的权重在主机内存中。
//添加池层：
auto pool1 = network->addPooling(*conv1->getOutput(0), PoolingType::kMAX, DimsHW{2, 2});
pool1->setStride(DimsHW{2, 2});
添加FullyConnected和Activation层：
auto ip1 = network->addFullyConnected(*pool1->getOutput(0), 500, weightMap["ip1filter"], weightMap["ip1bias"]);
auto relu1 = network->addActivation(*ip1->getOutput(0), ActivationType::kRELU);
//添加SoftMax层以计算最终概率并将其设置为输出：
auto prob = network->addSoftMax(*relu1->getOutput(0));
prob->getOutput(0)->setName(OUTPUT_BLOB_NAME);
//标记输出：
network->markOutput(*prob->getOutput(0));

View Code

　　在C ++中使用解析器导入模型：

　　　　1.C ++ Parser API导入Caffe模型：

Create the builder and network:
IBuilder* builder = createInferBuilder(gLogger);
INetworkDefinition* network = builder->createNetwork();
Create the Caffe parser:
ICaffeParser* parser = createCaffeParser();
Parse the imported model:
const IBlobNameToTensor* blobNameToTensor = parser->parse("deploy_file" , "modelFile", *network, DataType::kFLOAT);
Specify the outputs of the network:
for (auto& s : outputs)
    network->markOutput(*blobNameToTensor->find(s.c_str()));

View Code

　　　　2.导入TensorFlow模型：

//创建构建器和网络：
IBuilder* builder = createInferBuilder(gLogger);
INetworkDefinition* network = builder->createNetwork();
//创建UFF解析器：
IUFFParser* parser = createUffParser();
//向UFF解析器声明网络输入和输出：
parser->registerInput("Input_0", DimsCHW(1, 28, 28), UffInputOrder::kNCHW);
parser->registerOutput("Binary_3");
//解析导入的模型以填充网络：
parser->parse(uffFile, *network, nvinfer1::DataType::kFLOAT);

View Code

　　　　3.导入ONNX模型：早期的ONNX模型文件转换为更高支持的版本。some different。

//创建构建器和网络。
IBuilder* builder = createInferBuilder(gLogger);
const auto explicitBatch = 1U << static_cast<uint32_t>(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);  
INetworkDefinition* network = builder->createNetworkV2(explicitBatch);
//创建ONNX解析器。
nvonnxparser::IParser* parser = 
nvonnxparser::createParser(*network, gLogger);
//摄取模型：
parser->parseFromFile(onnx_filename, 
ILogger::Severity::kWARNING);

View Code

2.3、用C ++构建引擎

下一步是调用TensorRT构建器以创建优化的运行时。构建器的功能之一是搜索其CUDA内核目录，以获取可用的最快实现，因此，有必要使用与运行优化引擎的GPU相同的GPU进行构建。

 1 //使用构建器对象构建引擎：
 2 builder->setMaxBatchSize(maxBatchSize);
 3 IBuilderConfig* config = builder->createBuilderConfig();
 4 config->setMaxWorkspaceSize(1 << 20);
 5 ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);
 6 //构建引擎后，TensorRT会复制权重。
 7 
 8 //使用网络，构建器和解析器后删除它。
 9 parser->destroy();
10 network->destroy();
11 config->destroy();
12 builder->destroy();

View Code

2.4。在C ++中序列化模型

用于推理之前，不一定非要序列化和反序列化模型–如果需要，引擎对象可以直接用于推理，注意：序列化引擎不可跨平台或TensorRT 版本移植。

1 //作为先前的离线步骤运行构建器，然后进行序列化：
2 IHostMemory *serializedModel = engine->serialize();
3 // store model to disk
4 // <…>
5 serializedModel->destroy();
6 //创建运行时对象以反序列化：
7 IRuntime* runtime = createInferRuntime(gLogger);
8 ICudaEngine* engine = runtime->deserializeCudaEngine(modelData, modelSize, nullptr);

View Code

2.5。在C ++中执行推理

//创建一些空间来存储中间激活值。由于引擎保留了网络定义和训练有素的参数，因此需要额外的空间。这些是在执行上下文中保存的：
IExecutionContext *context = engine->createExecutionContext();
//引擎可以具有多个执行上下文，从而允许将一组权重用于多个重叠的推理任务。例如，您可以使用一个引擎和每个流一个上下文来处理并行CUDA流中的图像。每个上下文将在与引擎相同的GPU上创建。

//使用输入和输出Blob名称来获取相应的输入和输出索引：
int inputIndex = engine->getBindingIndex(INPUT_BLOB_NAME);
int outputIndex = engine->getBindingIndex(OUTPUT_BLOB_NAME);
//使用这些索引，设置一个缓冲区数组，该数组指向GPU上的输入和输出缓冲区：
void* buffers[2];
buffers[inputIndex] = inputbuffer;
buffers[outputIndex] = outputBuffer;
//TensorRT执行通常是异步的，因此 入队CUDA流上的内核：
context->enqueue(batchSize, buffers, stream, nullptr);
//这是常见的 enqueue 内核之前和之后的异步memcpy（）从GPU移动数据（如果尚未存在）。enqueue（）的最后一个参数是一个可选的CUDA事件，当输入缓冲区被消耗并且它们的内存可以安全地重用时，将发出信号。

//要确定何时完成内核（可能还包括memcpy（）），请使用标准的CUDA同步机制，例如事件或在流上等待。

View Code

2.6。C ++中的内存管理（两种机制）

　　默认情况下，创建 IExecutionContext，分配持久性设备内存来保存激活数据。为避免这种分配，请 go to createExecutionContextWithoutDeviceMemory。然后由应用程序负责调用 IExecutionContext :: setDeviceMemory（）提供运行网络所需的内存。内存块的大小由以下方式返回 ICudaEngine :: getDeviceMemorySize（）。

　　此外，应用程序可以通过实现以下操作来提供自定义分配器，以便在构建和运行时使用： IGpu分配器接口。接口实现后，调用setGpuAllocator(&allocator);在 iBuilder的 要么 运行时接口。然后将通过此接口分配和释放所有设备内存。

2.7、改装引擎

　　TensorRT可以为引擎配上新的权重，而无需对其进行改造。发动机必须制造为“可改装的”。由于优化引擎的方式，如果更改某些权重，则可能还必须提供其他权重。该界面可以告诉您需要提供哪些其他权重。

 1 //在构建引擎之前，请先请求它：
 2 ...
 3 builder->setRefittable(true); 
 4 builder->buildCudaEngine(network);
 5 //创建一个引用对象：
 6 ICudaEngine* engine = ...;
 7 IRefitter* refitter = createInferRefitter(*engine,gLogger)
 8 //更新您要更新的权重。例如，要更新名为“ MyLayer”的卷积层的内核权重：
 9 Weights newWeights = ...;
10 refitter->setWeights("MyLayer",WeightsRole::kKERNEL,
11                     newWeights);
12 //新的重量应与用于制造发动机的原始重量具有相同的计数。
13 
14 //setWeights 如果出现错误（例如错误的图层名称或角色或权重计数），则返回false。
15 
16 //找出必须提供的其他砝码。这通常需要两次呼叫 IRefitter :: getMissing，//先获取数量 重物 必须提供的对象，其次是获取其层和角色。
17 const int n = refitter->getMissing(0, nullptr, nullptr);
18 std::vector<const char*> layerNames(n);
19 std::vector<WeightsRole> weightsRoles(n);
20 refitter->getMissing(n, layerNames.data(), 
21                         weightsRoles.data());
22 //按任何顺序提供缺少的砝码：
23 for (int i = 0; i < n; ++i)
24     refitter->setWeights(layerNames[i], weightsRoles[i],
25                          Weights{...});
26 //仅提供缺少的重量将不会产生对更多重量的需求。提供任何其他重量可能会触发对更多重量的需求。
27 
28 //使用提供的所有权重更新引擎：
29 bool success = refitter->refitCudaEngine();
30 assert(success);
31 //如果 成功 是错误的，请检查日志以进行诊断，也许是关于仍然缺少的砝码。
32 
33 //销毁避难者：
34 refitter->destroy();
35 //更新的引擎的行为就像是从使用新权重更新的网络构建的一样。
36 
37 //要查看发动机中所有可改装的配重，请使用 refitter-> getAll（...）; 类似于 getMissing 在步骤3中使用

View Code

3.使用Python API

　　3.1、将TensorRT导入Python

　　3.2、在Python中创建网络定义

　　3.3、用Python构建引擎

　　3.4、在Python中序列化模型

　　3.5、在Python中执行推理

#将TensorRT导入Python

import tensorrt as trt
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)


创建网络定义
# Create the builder and network
with trt.Builder(TRT_LOGGER) as builder, builder.create_network() as network:
    # Configure the network layers based on the weights provided. In this case, the weights are imported from a pytorch model. 
    # Add an input layer. The name is a string, dtype is a TensorRT dtype, and the shape can be provided as either a list or tuple.
    input_tensor = network.add_input(name=INPUT_NAME, dtype=trt.float32, shape=INPUT_SHAPE)

    # Add a convolution layer
    conv1_w = weights['conv1.weight'].numpy()
    conv1_b = weights['conv1.bias'].numpy()
    conv1 = network.add_convolution(input=input_tensor, num_output_maps=20, kernel_shape=(5, 5), kernel=conv1_w, bias=conv1_b)
    conv1.stride = (1, 1)

    pool1 = network.add_pooling(input=conv1.get_output(0), type=trt.PoolingType.MAX, window_size=(2, 2))
    pool1.stride = (2, 2)
    conv2_w = weights['conv2.weight'].numpy()
    conv2_b = weights['conv2.bias'].numpy()
    conv2 = network.add_convolution(pool1.get_output(0), 50, (5, 5), conv2_w, conv2_b)
    conv2.stride = (1, 1)

    pool2 = network.add_pooling(conv2.get_output(0), trt.PoolingType.MAX, (2, 2))
    pool2.stride = (2, 2)

    fc1_w = weights['fc1.weight'].numpy()
    fc1_b = weights['fc1.bias'].numpy()
    fc1 = network.add_fully_connected(input=pool2.get_output(0), num_outputs=500, kernel=fc1_w, bias=fc1_b)

    relu1 = network.add_activation(fc1.get_output(0), trt.ActivationType.RELU)

    fc2_w = weights['fc2.weight'].numpy()
    fc2_b = weights['fc2.bias'].numpy()
    fc2 = network.add_fully_connected(relu1.get_output(0), OUTPUT_SIZE, fc2_w, fc2_b)

    fc2.get_output(0).name =OUTPUT_NAME


使用解析器导入模型

#CaffeParser
import tensorrt as trt

datatype = trt.float32

deploy_file = 'data/mnist/mnist.prototxt'
model_file = 'data/mnist/mnist.caffemodel'
with trt.Builder(TRT_LOGGER) as builder, builder.create_network() as network, trt.CaffeParser() as parser:
    model_tensors = parser.parse(deploy=deploy_file, 
                                 model=model_file, 
                                 network=network, 
                                 dtype=datatype)
 
#TensorFlow

import tensorrt as trt

#convert-to-uff frozen_inference_graph.pb
model_file = '/data/mnist/mnist.uff'

with builder = trt.Builder(TRT_LOGGER) as builder, builder.create_network() as network, trt.UffParser() as parser:
        parser.register_input("Placeholder", (1, 28, 28))
        parser.register_output("fc2/Relu")
parser.parse(model_file, network)

#从ONNX导入
EXPLICIT_BATCH = 1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
with builder = trt.Builder(TRT_LOGGER) as builder, builder.create_network(EXPLICIT_BATCH) as network, trt.OnnxParser(network, TRT_LOGGER) as parser:
with open(model_path, 'rb') as model:
parser.parse(model.read())

#从PyTorch和其他框架导入  ：使用TensorRT API复制网络体系结构（请参阅使用Python API从头开始创建网络定义），然后从PyTorch复制权重

构建引擎

builder.max_batch_size = max_batch_size
builder.max_workspace_size = 1 << 20 # This determines the amount of memory available to the builder when building an optimized engine and should generally be set as high as possible.    
with trt.Builder(TRT_LOGGER) as builder, 
    builder.create_builder_config() as config, 
    builder.build_cuda_engine(network, config) as engine:
    
    #Do_inference here.

序列化模型
    serialized_engine = engine.serialize()
    #写入文件
    with open(“sample.engine”, “wb”) as f:
        f.write(engine.serialize())
“”“        
with open(“sample.engine”, “rb”) as f, trt.Runtime(TRT_LOGGER) as runtime:
        engine = runtime.deserialize_cuda_engine(f.read())
        
with trt.Runtime(TRT_LOGGER) as runtime:
    engine = runtime.deserialize_cuda_engine(serialized_engine)
”“”

执行推理
def do_inference(engine):
#为输入和输出分配一些主机和设备缓冲区：
# Determine dimensions and create page-locked memory buffers (i.e. won't be swapped to disk) to hold host inputs/outputs.
    h_input = cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(0)), dtype=np.float32)
    h_output = cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(1)), dtype=np.float32)
# Allocate device memory for inputs and outputs.
    d_input = cuda.mem_alloc(h_input.nbytes)
    d_output = cuda.mem_alloc(h_output.nbytes)
# Create a stream in which to copy inputs/outputs and run inference.
    stream = cuda.Stream()
    
#创建一些空间来存储中间激活值。由于引擎保留了网络定义和训练有素的参数，因此需要额外的空间。这些是在执行上下文中保存的：
    with engine.create_execution_context() as context:
# Transfer input data to the GPU.
        cuda.memcpy_htod_async(d_input, h_input, stream)
# Run inference.
        context.execute_async(bindings=[int(d_input), int(d_output)], stream_handle=stream.handle)
# Transfer predictions back from the GPU.
        cuda.memcpy_dtoh_async(h_output, d_output, stream)
# Synchronize the stream
        stream.synchronize()
# Return the host output. 
    return h_output

if name ==__main__():
    engine = 
    do_inference(engine)

View Code

4. 使用自定义层扩展TensorRT

　　4.1、使用C ++ API添加自定

　　4.2.使用Python API添加自定义图层

　　　　4.1.1 TensorRT 里储存的自定义层注册后拿来用

import tensorrt as trt
import numpy as np

TRT_LOGGER = trt.Logger()

trt.init_libnvinfer_plugins(TRT_LOGGER, '')
PLUGIN_CREATORS = trt.get_plugin_registry().plugin_creator_list

def get_trt_plugin(plugin_name):
        plugin = None
        for plugin_creator in PLUGIN_CREATORS:
            if plugin_creator.name == plugin_name:
                lrelu_slope_field = trt.PluginField("neg_slope", np.array([0.1], dtype=np.float32), trt.PluginFieldType.FLOAT32)
                field_collection = trt.PluginFieldCollection([lrelu_slope_field])
                plugin = plugin_creator.create_plugin(name=plugin_name, field_collection=field_collection)
        return plugin

def main():
    with trt.Builder(TRT_LOGGER) as builder, builder.create_network() as network:
        builder.max_workspace_size = 2**20
        input_layer = network.add_input(name="input_layer", dtype=trt.float32, shape=(1, 1))
        lrelu = network.add_plugin_v2(inputs=[input_layer], plugin=get_trt_plugin("LReLU_TRT"))
        lrelu.get_output(0).name = "outputs"
        network.mark_output(lrelu.get_output(0))

View Code

　　　　4.2.2添加UFF不支持的自定义层（现实存在的 tensorflow 结构，而不是自己编写的层）

　　　　　　TensorFlow网络可以转换为UFF格式，并通过Python接口与TensorRT一起运行。为此，我们使用GraphSurgeon API。如果您要编写自己的插件，则需要通过实现 IPluginExt 和 IPluginCreator 如示例1：使用C ++为Caffe添加自定义层所示。

　　　　使用UFF解析器通过在TensorRT插件注册表中注册的插件节点来运行自定义层。

trt.init_libnvinfer_plugins(TRT_LOGGER, '') (or load the .so file ）

#Prepare the network and check the TensorFlow output:
tf_sess = tf.InteractiveSession()
tf_input = tf.placeholder(tf.float32, name="placeholder")
tf_lrelu = tf.nn.leaky_relu(tf_input, alpha=lrelu_alpha, name="tf_lrelu")
tf_result = tf_sess.run(tf_lrelu, feed_dict={tf_input: lrelu_args})
tf_sess.close()
#Prepare the namespace mappings. The op name LReLU_TRT #corresponds to the Leaky ReLU plugin shipped with TensorRT.
trt_lrelu = gs.create_plugin_node(name="trt_lrelu", op="LReLU_TRT", negSlope=lrelu_alpha)
namespace_plugin_map = {
            "tf_lrelu": trt_lrelu
 }
#Transform the TensorFlow graph using GraphSurgeon and save to UFF:
dynamic_graph = gs.DynamicGraph(tf_lrelu.graph)
dynamic_graph.collapse_namespaces(namespace_plugin_map)
#Run the UFF parser and compare results with TensorFlow:
uff_model = uff.from_tensorflow(dynamic_graph.as_graph_def(), ["trt_lrelu"], output_filename=model_path, text=True)
parser = trt.UffParser()
parser.register_input("placeholder", [lrelu_args.size])
parser.register_output("trt_lrelu")
parser.parse(model_path, trt_network)

View Code

　　4.3.从框架导入模型时使用自定义层

　　4.4。插件API说明

　　4.5。自定义图层插件的最佳插法

5.以混合精度工作

　　5.1: 使用C ++ API的混合精度

　　　　5.1.1。使用C ++设置层精度

　　　　5.1.2。使用C ++启用FP16推理

　　　　5.1.3。使用C ++启用INT8推理

　　　　5.1.4。精确工作精度

　　5.2、使用Python API的混合精度

　　　　5.2.1、使用Python设置图层精度

　　　　5.2.2、使用Python启用FP16推理

　　　　5.2.3、使用Python启用INT8推理式精度使用Python处理显

6.使用无需重新格式化的网络i/0张量

　　6.1.使用无需重新格式化的网络构建引擎

　　6.2、支持的的数据类型和内存布局的组合

　　6.3。使用INT8I / 0张量校准网络

7.使用动态形状

　　7.1。指定运行时维度

　　7.2。优化配置文件

　　7.3.动态形状的图层扩展

　　7.4。动态形状的限制

　　7.5。执行张量vs.形状张量

　　7.5.1。正式推理规则

　　7.6。形状张量1/ 0(高级)

8.使用循环

　　8.1。定义循环

　　8.2.形式语义学

　　8.3。嵌套循环

　　8.4.局限性

9.使用量化网络　　

　　9.1使用TensorFlow的量化意识训练( QAT )

　　9.2。将Tensorflow转换为ONNX量化模型

　　9.3。导入量化的ONX模型

10.使用DLA

　　10.1。在TensorRT推理期间在DLA上运行

　　　　10.1.1.示例1 :带有DLA的sampleMNIST

　　　　10.1.2.示例2 :在网络创建期间为层启用DLA模式

　　10.2。DLA支持的层

　　10.3。GPU后备模式

11.部署TensorRT优化模型

　　11.1。云端部署

　　11.2。部署到嵌入式系统

12.使用深度学习框架

　　12.1 使用TensorFlow

　　　　12.1.1、冻结TensorFlow图

　　　　12.1.2、冻结Keras模型

　　　　12.1.3、将冻结图转换为UFF

　　　　12.1.4.使用TensorFlowRNN权重

　　　　　　12.1.4.1。TensorRT支持的TensorFlow RNN单元

　　　　　　12.1.4.2.维护TensorFlow和TensorRT之间的模型一致性

　　　　　　12.1.4.3。工作流程

　　　　　　12.1.4.4.转储TensorFlow权重E

　　　　　　12.1.4.5.装卸重量

　　　　　　12.1.4.6。将权重转换为TensorRT格式

　　　　　　　　12.1.4.6.1.TensorFlow Checkpoint存储格式

　　　　　　　　12.1.4.6.2.TensorFlow内核Tensor存储格式

　　　　　　　　12.1.4.6.3.内核权重转换为TensorRT格式

　　　　　　　　12.1.4.6.4.TensorFlow偏差权重存储格式

　　　　　　　　12.1.4.6.5.将偏置Tensor转换为TensorRT格式

　　　　　　12.1.4.7.BasicLSTMCell示例

　　　　　　　　12.1.4.7.1.基本的LSTMCell内核张量

　　　　　　　　12.1.4.7.2. 基本的LSTMCel{偏置张量

　　　　　　　　12.1.4.8.设置转换的权重和偏差

　　　　12.1.5.使用图外科医生API预处理TensorFlow图

　　12.2、使用PyTorch和其他框
架

13.使用DALI

　　13.1。整合的好处

14.故障排除

　　14.1。常见问题
　　14.2。如何报告错误?
　　14.3。了解错误消息
　　14.4。支持