vllm实战- 解决TTFT响应慢的问题

目录

    大 batch token + seq → 模型一次计算更多 token
    GPU kernel / tensor parallel 高效运行 → 单条请求的等待时间被 amortized(摊薄)
    因此 单条请求 TTFT 中位反而下降

    
    📊 Benchmark 测试
       max-num-seqs = 64
       max-num-batched-tokens = 6144
       请求速率 = 20 req/s
    ──────────────────────────────────────────────────────────────────────
       测试长度: 输入=200, 输出=100
       请求数量: 50, 超时: 600秒
       总上下文长度: ~300 tokens (约 14.6% 上下文)
    
    
    
    
    ============ Serving Benchmark Result ============
    Successful requests:                     50
    Failed requests:                         0
    Request rate configured (RPS):           20.00
    Benchmark duration (s):                  15.41
    Total input tokens:                      10000
    Total generated tokens:                  5000
    Request throughput (req/s):              3.24
    Output token throughput (tok/s):         324.45
    Peak output token throughput (tok/s):    450.00
    Peak concurrent requests:                50.00
    Total token throughput (tok/s):          973.34
    ---------------Time to First Token----------------
    Mean TTFT (ms):                          381.40
    Median TTFT (ms):                        376.31
    P99 TTFT (ms):                           581.25
    -----Time per Output Token (excl. 1st token)------
    Mean TPOT (ms):                          127.90
    Median TPOT (ms):                        128.27
    P99 TPOT (ms):                           134.23
    ---------------Inter-token Latency----------------
    Mean ITL (ms):                           127.90
    Median ITL (ms):                         120.87
    P99 ITL (ms):                            296.96
    
    
    ======================================================================
    📊 完整性能对比报告
    ======================================================================
    
    排名   max_seqs   max_btk      req_rate   吞吐量          TTFT中位       请求/秒       配置名称
                    (tokens)     (req/s)    (tok/s)      (ms)         (req/s)
    --------------------------------------------------------------------------------------------------------------
    🥇    64         6144         20         324.45       376.31       3.24       方案7: 极限激进 (64 seqs, 6K token
    🥈    96         7168         20         317.44       372.66       3.17       方案9: 极限吞吐 (96 seqs, 7K token
    🥉    128        8192         15         305.59       296.29       3.06       方案10: 激进边界 (128 seqs, 8K tok
     4   80         7168         20         277.16       381.20       2.77       方案8: 超激进 (80 seqs, 7K tokens
     5   48         5120         25         205.07       456.34       2.05       方案6: 极限安全 (48 seqs, 5K token
     6   48         4096         30         204.67       560.14       2.05       方案4: 高并发压测 (48 seqs, 4K toke
     7   32         5120         25         202.23       448.84       2.02       方案5: 激进吞吐 (32 seqs, 5K token
     8   32         4096         25         201.97       450.60       2.02       方案3: 平衡推荐 (32 seqs, 4K token
     9   24         3072         20         143.48       10661.33     1.43       方案2: 稳定吞吐 (24 seqs, 3K token
    10   16         2048         15         109.54       10512.33     1.10       方案1: 保守起点 (16 seqs, 2K token
    
    posted @ 2026-01-21 17:57  向着朝阳  阅读(1)  评论(0)    收藏  举报