Auto-Tuning 原理
Auto-Tuning 原理
9.8.1英伟达GPU卷积网络的自动调谐
针对特定设备和工作负载的自动调谐对于获得最佳性能至关重要。本节描述如何为英伟达GPU调谐整个卷积网络。
TVM 中英伟达GPU 的运算符实现是以模板形式编写的。该模板有许多可调旋钮(平铺系数、展开等)。将调谐所有卷积和深度卷积运算符 在神经网络中。调谐后,生成一个日志文件,用于存储所有必需算子的最佳旋钮值。当TVM编译器编译这些运算符时,它将查询此日志文件以获得最佳旋钮值。
还发布了一些英伟达GPU的预调谐参数,可查看英伟达GPU基准测试结果。
注意,这里不会在Windows或最新版本的macOS上运行。要运行它,需要将正文封装在if __name__ == "__main__": block块中。
安装依赖项
要在tvm中使用autovm包,需要安装一些额外的依赖项。(如果使用python2,请将“3”更改为“2”):
pip3 install --user psutil xgboost tornado cloudpickle
为了使TVM在调谐过程中运行得更快,建议使用cython作为TVM的FFI。在tvm的根目录中,执行:
pip3 install --user cython
sudo make cython3
现在返回到python代码。导入程序包。
import os
import numpy as np
import tvm
from tvm import relay, autotvm
import tvm.relay.testing
from tvm.autotvm.tuner import XGBTuner, GATuner, RandomTuner, GridSearchTuner
import tvm.contrib.graph_executor as runtime
定义网络
首先,需要定义Relay前端API中的网络。可以从tvm.relay.testing加载一些预定义的网络。还可以从MXNet、ONNX和TensorFlow加载模型。
def get_network(name, batch_size):
"""获取网络的符号定义和随机权重"""
input_shape = (batch_size, 3, 224, 224)
output_shape = (batch_size, 1000)
if "resnet" in name:
n_layer = int(name.split("-")[1])
mod, params = relay.testing.resnet.get_workload(
num_layers=n_layer, batch_size=batch_size, dtype=dtype
)
elif "vgg" in name:
n_layer = int(name.split("-")[1])
mod, params = relay.testing.vgg.get_workload(
num_layers=n_layer, batch_size=batch_size, dtype=dtype
)
elif name == "mobilenet":
mod, params = relay.testing.mobilenet.get_workload(batch_size=batch_size, dtype=dtype)
elif name == "squeezenet_v1.1":
mod, params = relay.testing.squeezenet.get_workload(
batch_size=batch_size, version="1.1", dtype=dtype
)
elif name == "inception_v3":
input_shape = (batch_size, 3, 299, 299)
mod, params = relay.testing.inception_v3.get_workload(batch_size=batch_size, dtype=dtype)
else:
raise ValueError("Unsupported network: " + name)
return mod, params, input_shape, output_shape
设置调谐选项
在进行调优之前,将应用一些配置。
#### 设备配置 ####
#### 调谐选择 ####
network = "resnet-18"
dtype = "float32"
tuning_option = {
"log_filename": log_file,
"tuner": "xgb",
"n_trial": 2000,
"early_stopping": 600,
"measure_option": autotvm.measure_option(
builder=autotvm.LocalBuilder(timeout=10),
runner=autotvm.LocalRunner(number=20, repeat=3, timeout=4, min_repeat_ms=150),
),
}
/workspace/python/tvm/target/target.py:446: UserWarning:用户警告:尝试在目标中添加“arch=sm_xx”来说明cuda架构。
warnings.warn("在目标中添加“arch=sm_xx”来说明cuda架构。")
注意
如何设置调谐选项
一般来说,此处提供的默认值效果良好。
如果有很大的时间预算,可以将n_trial、early_stopping设置得更大,这会使调优运行时间更长。
如果有多个设备,则可以使用所有设备进行测量,以进行测量 加快调谐过程。(请参阅下面的“放大测量”部分)。
如果有多个设备,可以使用所有设备进行测量,以加快调谐过程。(请参阅下面的“放大测量”部分)。
开始调谐
现在,可以从网络中提取调谐任务并开始调谐。在这里,提供了一个简单的实用函数来调谐任务列表。这个函数只是一个初始实现,按顺序进行调优。未来可引入更复杂的调优调度程序。
# 可以跳过此函数的实现。
def tune_tasks(
tasks,
measure_option,
tuner="xgb",
n_trial=1000,
early_stopping=None,
log_filename="tuning.log",
use_transfer_learning=True,
):
# 创建临时日志文件
tmp_log_file = log_filename + ".tmp"
if os.path.exists(tmp_log_file):
os.remove(tmp_log_file)
for i, tsk in enumerate(reversed(tasks)):
prefix = "[Task %2d/%2d] " % (i + 1, len(tasks))
# 创建调谐器
if tuner == "xgb":
tuner_obj = XGBTuner(tsk, loss_type="reg")
elif tuner == "xgb_knob":
tuner_obj = XGBTuner(tsk, loss_type="reg", feature_type="knob")
elif tuner == "xgb_itervar":
tuner_obj = XGBTuner(tsk, loss_type="reg", feature_type="itervar")
elif tuner == "xgb_curve":
tuner_obj = XGBTuner(tsk, loss_type="reg", feature_type="curve")
elif tuner == "xgb_rank":
tuner_obj = XGBTuner(tsk, loss_type="rank")
elif tuner == "xgb_rank_knob":
tuner_obj = XGBTuner(tsk, loss_type="rank", feature_type="knob")
elif tuner == "xgb_rank_itervar":
tuner_obj = XGBTuner(tsk, loss_type="rank", feature_type="itervar")
elif tuner == "xgb_rank_curve":
tuner_obj = XGBTuner(tsk, loss_type="rank", feature_type="curve")
elif tuner == "xgb_rank_binary":
tuner_obj = XGBTuner(tsk, loss_type="rank-binary")
elif tuner == "xgb_rank_binary_knob":
tuner_obj = XGBTuner(tsk, loss_type="rank-binary", feature_type="knob")
elif tuner == "xgb_rank_binary_itervar":
tuner_obj = XGBTuner(tsk, loss_type="rank-binary", feature_type="itervar")
elif tuner == "xgb_rank_binary_curve":
tuner_obj = XGBTuner(tsk, loss_type="rank-binary", feature_type="curve")
elif tuner == "ga":
tuner_obj = GATuner(tsk, pop_size=100)
elif tuner == "random":
tuner_obj = RandomTuner(tsk)
elif tuner == "gridsearch":
tuner_obj = GridSearchTuner(tsk)
else:
raise ValueError("Invalid tuner: " + tuner)
if use_transfer_learning:
if os.path.isfile(tmp_log_file):
tuner_obj.load_history(autotvm.record.load_from_file(tmp_log_file))
# 进行调谐
tsk_trial = min(n_trial, len(tsk.config_space))
tuner_obj.tune(
n_trial=tsk_trial,
early_stopping=early_stopping,
measure_option=measure_option,
callbacks=[
autotvm.callback.progress_bar(tsk_trial, prefix=prefix),
autotvm.callback.log_to_file(tmp_log_file),
],
)
# 选择缓存文件中的最佳记录
autotvm.record.pick_best(tmp_log_file, log_filename)
os.remove(tmp_log_file)
最后,启动调优作业并评估端到端性能。
def tune_and_evaluate(tuning_opt):
# 从relay程序中提取工作负载
print("Extract tasks...")
mod, params, input_shape, out_shape = get_network(network, batch_size=1)
tasks = autotvm.task.extract_from_program(
mod["main"], target=target, params=params, ops=(relay.op.get("nn.conv2d"),)
)
# 运行调谐任务
print("Tuning...")
tune_tasks(tasks, **tuning_opt)
# 使用历史最佳记录编译内核
with autotvm.apply_history_best(log_file):
print("Compile...")
with tvm.transform.PassContext(opt_level=3):
lib = relay.build_module.build(mod, target=target, params=params)
# 加载参数
dev = tvm.device(str(target), 0)
module = runtime.GraphModule(lib["default"](dev))
data_tvm = tvm.nd.array((np.random.uniform(size=input_shape)).astype(dtype))
module.set_input("data", data_tvm)
# 评估
print("评估推理时间成本...")
print(module.benchmark(dev, number=1, repeat=600))
# 不在网页服务器中运行调谐,因为它花费的时间太长。
# 注销注释以下行。
# tune_and_evaluate(tuning_option)
示例输出
调优需要编译许多程序并从中提取功能。因此,建议使用高性能CPU。下面列出了一个示例输出。在32T AMD Ryzen Threadipper上获得以下输出大约需要4个小时。调谐目标是英伟达1080 Ti。(可以在编译过程中看到一些错误。如果调谐没有卡住,那就没问题。)
Extract tasks...
Tuning...
[Task 1/12] Current/Best: 541.83/3570.66 GFLOPS | Progress: (960/2000) | 1001.31 s Done.
[Task 2/12] Current/Best: 0.56/ 803.33 GFLOPS | Progress: (704/2000) | 608.08 s Done.
[Task 3/12] Current/Best: 103.69/1141.25 GFLOPS | Progress: (768/2000) | 702.13 s Done.
[Task 4/12] Current/Best: 2905.03/3925.15 GFLOPS | Progress: (864/2000) | 745.94 sterminate called without an active exception
[Task 4/12] Current/Best: 2789.36/3925.15 GFLOPS | Progress: (1056/2000) | 929.40 s Done.
[Task 5/12] Current/Best: 89.06/1076.24 GFLOPS | Progress: (704/2000) | 601.73 s Done.
[Task 6/12] Current/Best: 40.39/2129.02 GFLOPS | Progress: (1088/2000) | 1125.76 s Done.
[Task 7/12] Current/Best: 4090.53/5007.02 GFLOPS | Progress: (800/2000) | 903.90 s Done.
[Task 8/12] Current/Best: 4.78/1272.28 GFLOPS | Progress: (768/2000) | 749.14 s Done.
[Task 9/12] Current/Best: 1391.45/2325.08 GFLOPS | Progress: (992/2000) | 1084.87 s Done.
[Task 10/12] Current/Best: 1995.44/2383.59 GFLOPS | Progress: (864/2000) | 862.60 s Done.
[Task 11/12] Current/Best: 4093.94/4899.80 GFLOPS | Progress: (224/2000) | 240.92 sterminate called without an active exception
[Task 11/12] Current/Best: 3487.98/4909.91 GFLOPS | Progress: (480/2000) | 534.96 sterminate called without an active exception
[Task 11/12] Current/Best: 4636.84/4912.17 GFLOPS | Progress: (1184/2000) | 1381.16 sterminate called without an active exception
[Task 11/12] Current/Best: 50.12/4912.17 GFLOPS | Progress: (1344/2000) | 1602.81 s Done.
[Task 12/12] Current/Best: 3581.31/4286.30 GFLOPS | Progress: (736/2000) | 943.52 s Done.
Compile...
Evaluate inference time cost...
Mean inference time (std dev): 1.07 ms (0.05 ms)
作为参考基线,MXNet + TensorRT 在 resnet-18 上的时间成本为 1.30ms。所以快了一点。
作为参考基线,MXNet+TensorRT在resnet-18上的时间成本为1.30ms。所以快了一点。
注意
遇到困难?
自动调谐模块容易出错。如果总是看到“0.00/ 0.00 GFLOPS”,那么一定有什么不对劲。
首先,请确保设置了正确的设备配置。然后,可以通过在开头添加这些行来输出调试信息 的脚本。将输出每个测量结果,可以在其中找到有用的错误消息。
import logging
logging.getLogger('autotvm').setLevel(logging.DEBUG)
使用多个设备放大测量
如果有多个设备,则可以使用所有设备进行测量。TVM使用RPC跟踪器来管理分布式设备。RPC跟踪器是一个集中式控制器节点。可以将所有设备注册到跟踪器。例如,如果有10个GPU卡,可以将GPU卡全部注册到跟踪器中,同时并行运行10个测量,从而加快调谐过程。
若要启动RPC跟踪器,请在主机上运行此命令。在整个调谐过程中都需要跟踪器,因此需要为该命令打开一个新的终端:
python -m tvm.exec.rpc_tracker --host=0.0.0.0 --port=9190
预期输出为
INFO:RPCTracker:bind to 0.0.0.0:9190
然后为RPC服务器打开另一个新终端。需要为每个设备启动一个专用服务器。使用字符串键来区分设备的类型。可以选择一个喜欢的名字。(注意:对于rocm后端,编译器存在一些内部错误,需要在参数列表中添加–no fork。)
python -m tvm.exec.rpc_server --tracker=127.0.0.1:9190 --key=1080ti
注册设备后,可以通过查询rpc_tracker进行确认
python -m tvm.exec.query_rpc_tracker --host=127.0.0.1 --port=9190
例如,如果有四个1080ti、两个titanx和一个gfx900,输出可以是
查询状态
----------------------------------
key 全部 空闲 挂起
----------------------------------
1080ti 4 4 0
titanx 2 2 0
gfx900 1 1 0
----------------------------------
最后,需要更改调谐选项以使用 RPCRunner。使用下面的代码替换上面的相应部分。
tuning_option = {
"log_filename": log_file,
"tuner": "xgb",
"n_trial": 2000,
"early_stopping": 600,
"measure_option": autotvm.measure_option(
builder=autotvm.LocalBuilder(timeout=10),
runner=autotvm.RPCRunner(
"1080ti", # 将设备key更改为自己的key
"127.0.0.1",
9190,
number=20,
repeat=3,
timeout=4,
min_repeat_ms=150,
),
),
}
9.8.2 自动调谐x86 CPU的卷积网络
这是一个关于如何调谐x86 CPU卷积神经网络的示例。
本示例不会在Windows或最新版本的macOS上运行。要运行它,需要将正文封装在if __name__ == "__main__": block块中。
import os
import numpy as np
import tvm
from tvm import relay, autotvm
from tvm.relay import testing
from tvm.autotvm.tuner import XGBTuner, GATuner, RandomTuner, GridSearchTuner
from tvm.autotvm.graph_tuner import DPTuner, PBQPTuner
import tvm.contrib.graph_executor as runtime
定义网络
首先,需要定义relay前端API中的网络。可以从relay.testing加载一些预定义的网络,也可以用relay.testing.resnet构建relay.testing.resnet。还可以从MXNet、ONNX和TensorFlow加载模型。
这里选择resnet-18作为调优示例。
def get_network(name, batch_size):
"""获取网络的符号定义和随机权重"""
input_shape = (batch_size, 3, 224, 224)
output_shape = (batch_size, 1000)
if "resnet" in name:
n_layer = int(name.split("-")[1])
mod, params = relay.testing.resnet.get_workload(
num_layers=n_layer, batch_size=batch_size, dtype=dtype
)
elif "vgg" in name:
n_layer = int(name.split("-")[1])
mod, params = relay.testing.vgg.get_workload(
num_layers=n_layer, batch_size=batch_size, dtype=dtype
)
elif name == "mobilenet":
mod, params = relay.testing.mobilenet.get_workload(batch_size=batch_size, dtype=dtype)
elif name == "squeezenet_v1.1":
mod, params = relay.testing.squeezenet.get_workload(
batch_size=batch_size, version="1.1", dtype=dtype
)
elif name == "inception_v3":
input_shape = (batch_size, 3, 299, 299)
mod, params = relay.testing.inception_v3.get_workload(batch_size=batch_size, dtype=dtype)
else:
raise ValueError("Unsupported network: " + name)
return mod, params, input_shape, output_shape
# 将"llvm"替换为正确的CPU目标。
# 例如,对于带有Intel Xeon Platinum 8000系列的AWS EC2 c5实例,
# 目标应该是"llvm -mcpu=skylake-avx512"
# 对于带有Intel Xeon E5-2666 v3的AWS EC2 c4实例,
# 它应该是"llvm -mcpu=core-avx2".
target = "llvm"
batch_size = 1
dtype = "float32"
model_name = "resnet-18"
log_file = "%s.log" % model_name
graph_opt_sch_file = "%s_graph_opt.log" % model_name
# 设置ONNX型号图形的输入名称,通常为“0”。
input_name = "data"
# 根据机器上的物理CPU核心数量,设置用于调优的线程数量。
num_threads = 1
os.environ["TVM_NUM_THREADS"] = str(num_threads)
配置张量调谐设置并创建任务
为了在x86 CPU上获得更好的内核执行性能,需要将卷积内核的数据布局从“NCHW”更改为“NCHWc”。为了应对这种情况,在topi中定义了conv2d_NCHWc运算符。将调谐此运算符,而不是普通的conv2d。
将使用本地模式来调谐配置。RPC跟踪器模式的设置类似于ARM CPU自动调谐卷积网络方法。
为了进行精确的测量,应该重复测量几次,并使用结果的平均值。此外,需要刷新重复测量之间的权重张量的缓存。这可以使一个算子的测量延迟更接近其在端到端推理期间的实际延迟。
tuning_option = {
"log_filename": log_file,
"tuner": "random",
"early_stopping": None,
"measure_option": autotvm.measure_option(
builder=autotvm.LocalBuilder(),
runner=autotvm.LocalRunner(
number=1, repeat=10, min_repeat_ms=0, enable_cpu_cache_flush=True
),
),
}
# 可以跳过此函数的实现。
def tune_kernels(
tasks, measure_option, tuner="gridsearch", early_stopping=None, log_filename="tuning.log"
):
for i, task in enumerate(tasks):
prefix = "[Task %2d/%2d] " % (i + 1, len(tasks))
# create tuner
if tuner == "xgb":
tuner_obj = XGBTuner(task, loss_type="reg")
elif tuner == "xgb_knob":
tuner_obj = XGBTuner(task, loss_type="reg", feature_type="knob")
elif tuner == "xgb_itervar":
tuner_obj = XGBTuner(task, loss_type="reg", feature_type="itervar")
elif tuner == "xgb_curve":
tuner_obj = XGBTuner(task, loss_type="reg", feature_type="curve")
elif tuner == "xgb_rank":
tuner_obj = XGBTuner(task, loss_type="rank")
elif tuner == "xgb_rank_knob":
tuner_obj = XGBTuner(task, loss_type="rank", feature_type="knob")
elif tuner == "xgb_rank_itervar":
tuner_obj = XGBTuner(task, loss_type="rank", feature_type="itervar")
elif tuner == "xgb_rank_curve":
tuner_obj = XGBTuner(task, loss_type="rank", feature_type="curve")
elif tuner == "xgb_rank_binary":
tuner_obj = XGBTuner(task, loss_type="rank-binary")
elif tuner == "xgb_rank_binary_knob":
tuner_obj = XGBTuner(task, loss_type="rank-binary", feature_type="knob")
elif tuner == "xgb_rank_binary_itervar":
tuner_obj = XGBTuner(task, loss_type="rank-binary", feature_type="itervar")
elif tuner == "xgb_rank_binary_curve":
tuner_obj = XGBTuner(task, loss_type="rank-binary", feature_type="curve")
elif tuner == "ga":
tuner_obj = GATuner(task, pop_size=50)
elif tuner == "random":
tuner_obj = RandomTuner(task)
elif tuner == "gridsearch":
tuner_obj = GridSearchTuner(task)
else:
raise ValueError("Invalid tuner: " + tuner)
# do tuning
n_trial = len(task.config_space)
tuner_obj.tune(
n_trial=n_trial,
early_stopping=early_stopping,
measure_option=measure_option,
callbacks=[
autotvm.callback.progress_bar(n_trial, prefix=prefix),
autotvm.callback.log_to_file(log_filename),
],
)
# 使用图形调谐器以实现图形级别的最佳时间表如果完成时间过长,则设置
# Use_DP=False。
def tune_graph(graph, dshape, records, opt_sch_file, use_DP=True):
target_op = [
relay.op.get("nn.conv2d"),
]
Tuner = DPTuner if use_DP else PBQPTuner
executor = Tuner(graph, {input_name: dshape}, records, target_op, target)
executor.benchmark_layout_transform(min_exec_num=2000)
executor.run()
executor.write_opt_sch2record_file(opt_sch_file)
最后,启动调优作业并评估端到端性能。
def evaluate_performance(lib, data_shape):
# 上传参数到设备
dev = tvm.cpu()
data_tvm = tvm.nd.array((np.random.uniform(size=data_shape)).astype(dtype))
module = runtime.GraphModule(lib["default"](dev))
module.set_input(input_name, data_tvm)
# 评估
print("Evaluate inference time cost...")
print(module.benchmark(dev, number=100, repeat=3))
def tune_and_evaluate(tuning_opt):
# 从relay程序中提取工作负载
print("Extract tasks...")
mod, params, data_shape, out_shape = get_network(model_name, batch_size)
tasks = autotvm.task.extract_from_program(
mod["main"], target=target, params=params, ops=(relay.op.get("nn.conv2d"),)
)
# 运行调谐任务
tune_kernels(tasks, **tuning_opt)
tune_graph(mod["main"], data_shape, log_file, graph_opt_sch_file)
# 编译缺省模式中内核
print("在没有自动调整的默认模式下编译的网络评估:")
with tvm.transform.PassContext(opt_level=3):
print("Compile...")
lib = relay.build(mod, target=target, params=params)
evaluate_performance(lib, data_shape)
# 在调优模式下编译内核
print("\n在内核级别调谐网络评估:")
with autotvm.apply_history_best(log_file):
print("Compile...")
with tvm.transform.PassContext(opt_level=3):
lib = relay.build(mod, target=target, params=params)
evaluate_performance(lib, data_shape)
# 使用图级最佳记录编译内核
print("\n在图形级上调整网络评估:")
with autotvm.apply_graph_best(graph_opt_sch_file):
print("Compile...")
with tvm.transform.PassContext(opt_level=3):
lib = relay.build_module.build(mod, target=target, params=params)
evaluate_performance(lib, data_shape)
# 不在网页服务器中运行调整,因为花费的时间太长。注销以下行,以自主模块运行它。
# tune_and_evaluate(tuning_option)
样本输出
调优需要编译许多程序并从中提取特征。因此,建议使用高性能CPU。下面列出了一个示例输出。
提取任务...
调谐...
[Task 1/12] Current/Best: 598.05/2497.63 GFLOPS | Progress: (252/252) | 1357.95 s Done.
[Task 2/12] Current/Best: 522.63/2279.24 GFLOPS | Progress: (784/784) | 3989.60 s Done.
[Task 3/12] Current/Best: 447.33/1927.69 GFLOPS | Progress: (784/784) | 3869.14 s Done.
[Task 4/12] Current/Best: 481.11/1912.34 GFLOPS | Progress: (672/672) | 3274.25 s Done.
[Task 5/12] Current/Best: 414.09/1598.45 GFLOPS | Progress: (672/672) | 2720.78 s Done.
[Task 6/12] Current/Best: 508.96/2273.20 GFLOPS | Progress: (768/768) | 3718.75 s Done.
[Task 7/12] Current/Best: 469.14/1955.79 GFLOPS | Progress: (576/576) | 2665.67 s Done.
[Task 8/12] Current/Best: 230.91/1658.97 GFLOPS | Progress: (576/576) | 2435.01 s Done.
[Task 9/12] Current/Best: 487.75/2295.19 GFLOPS | Progress: (648/648) | 3009.95 s Done.
[Task 10/12] Current/Best: 182.33/1734.45 GFLOPS | Progress: (360/360) | 1755.06 s Done.
[Task 11/12] Current/Best: 372.18/1745.15 GFLOPS | Progress: (360/360) | 1684.50 s Done.
[Task 12/12] Current/Best: 215.34/2271.11 GFLOPS | Progress: (400/400) | 2128.74 s Done.
INFO Start to benchmark layout transformation...
INFO Benchmarking layout transformation successful.
INFO Start to run dynamic programming algorithm...
INFO Start forward pass...
INFO Finished forward pass.
INFO Start backward pass...
INFO Finished backward pass...
INFO Finished DPExecutor run.
INFO Writing optimal schedules to resnet-18_graph_opt.log successfully.
INFO开始基准布局转换...
INFO基准布局转换成功...
INFO开始运行动态编程算法...
信息开始正向传递...
INFO完成向前传球。
信息开始向后传球...
信息完成后向传递...
信息已完成DPExecutor运行。
信息将最佳计划成功写入resnet-18_graph_opt.log。
在不进行自动调谐的情况下以默认模式编译的网络评估:
编写
评估推理时间成本...
平均推理时间(std-dev):4.5毫秒(0.03毫秒)
网络评估已在内核级别进行了调整:
编写
评估推理时间成本...
平均推理时间(std-dev):3.2毫秒(0.03毫秒)
网络评估已在图形级别上调整:
编译
Config for target=llvm -keys=cpu, workload=('dense_nopack.x86', ('TENSOR', (1, 512), 'float32'), ('TENSOR', (1000, 512), 'float32'), None, 'float32') is missing in ApplyGraphBest context. A fallback configuration is used, which may bring great performance regression.
Config for target=llvm -keys=cpu, workload=('dense_pack.x86', ('TENSOR', (1, 512), 'float32'), ('TENSOR', (1000, 512), 'float32'), None, 'float32') ApplyGraphBest上下文中缺少。使用了回退机制,这可能会带来很大的性能回归。
评估推理时间成本...
平均推理时间(std-dev):3.16 ms(0.03 ms)