vllm实战- 解决TTFT响应慢的问题
目录
大 batch token + seq → 模型一次计算更多 token
GPU kernel / tensor parallel 高效运行 → 单条请求的等待时间被 amortized(摊薄)
因此 单条请求 TTFT 中位反而下降
📊 Benchmark 测试
max-num-seqs = 64
max-num-batched-tokens = 6144
请求速率 = 20 req/s
──────────────────────────────────────────────────────────────────────
测试长度: 输入=200, 输出=100
请求数量: 50, 超时: 600秒
总上下文长度: ~300 tokens (约 14.6% 上下文)
============ Serving Benchmark Result ============
Successful requests: 50
Failed requests: 0
Request rate configured (RPS): 20.00
Benchmark duration (s): 15.41
Total input tokens: 10000
Total generated tokens: 5000
Request throughput (req/s): 3.24
Output token throughput (tok/s): 324.45
Peak output token throughput (tok/s): 450.00
Peak concurrent requests: 50.00
Total token throughput (tok/s): 973.34
---------------Time to First Token----------------
Mean TTFT (ms): 381.40
Median TTFT (ms): 376.31
P99 TTFT (ms): 581.25
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 127.90
Median TPOT (ms): 128.27
P99 TPOT (ms): 134.23
---------------Inter-token Latency----------------
Mean ITL (ms): 127.90
Median ITL (ms): 120.87
P99 ITL (ms): 296.96
======================================================================
📊 完整性能对比报告
======================================================================
排名 max_seqs max_btk req_rate 吞吐量 TTFT中位 请求/秒 配置名称
(tokens) (req/s) (tok/s) (ms) (req/s)
--------------------------------------------------------------------------------------------------------------
🥇 64 6144 20 324.45 376.31 3.24 方案7: 极限激进 (64 seqs, 6K token
🥈 96 7168 20 317.44 372.66 3.17 方案9: 极限吞吐 (96 seqs, 7K token
🥉 128 8192 15 305.59 296.29 3.06 方案10: 激进边界 (128 seqs, 8K tok
4 80 7168 20 277.16 381.20 2.77 方案8: 超激进 (80 seqs, 7K tokens
5 48 5120 25 205.07 456.34 2.05 方案6: 极限安全 (48 seqs, 5K token
6 48 4096 30 204.67 560.14 2.05 方案4: 高并发压测 (48 seqs, 4K toke
7 32 5120 25 202.23 448.84 2.02 方案5: 激进吞吐 (32 seqs, 5K token
8 32 4096 25 201.97 450.60 2.02 方案3: 平衡推荐 (32 seqs, 4K token
9 24 3072 20 143.48 10661.33 1.43 方案2: 稳定吞吐 (24 seqs, 3K token
10 16 2048 15 109.54 10512.33 1.10 方案1: 保守起点 (16 seqs, 2K token

浙公网安备 33010602011771号