当我启动ollama服务后,调用不同的模型时,如何知道我这个模型的具体输出tokens/s的数值呢?一文详解!
首先我打开ollama,看下我目前安装了哪些模型:
E:\>ollama list NAME ID SIZE MODIFIED qwen3:32b 030ee887880f 20 GB 10 hours ago llama3.2:latest a80c4f17acd5 2.0 GB 5 days ago deepseek-r1:1.5b a42b25d8c10a 1.1 GB 2 weeks ago
这里以qwen3:32b模型为例:
方法 1: 使用 Ollama CLI 运行模型并检查性能指标
- 启动模型:
- 使用 ollama run 命令运行你想测试的模型,例如:
- 启用详细模式:
- 如果需要更详细的日志,可以在启动时添加 --verbose 参数:
ollama run qwen3:32b --verbose
这会提供更多指标,如 prompt_eval_duration 和 eval_rate,但 eval_count 和 eval_duration 仍是计算 tokens/s 的关键。
E:\>ollama run qwen3:32b --verbose >>> Hello, I'm doing great! How about you? <think> Okay, the user said "Hello, I'm doing great! How about you?" So I need to respond to their greeting first. They're asking how I'm doing, so I should acknowledge that. But I need to remember that as an AI, I don't have feelings in the traditional sense. I should be friendly and positive while being honest about my nature. The conversation seems to be starting off on a positive note. I should match their cheerful tone with something equally upbeat. Maybe something about being ready to help or excited to chat? I can mention that while I don't experience emotions, I'm here and available to assist them with whatever they need. I should also consider how to make the conversation flow naturally. After acknowledging their greeting, perhaps ask them what they'd like to talk about or how I can help them today. Keeping it open-ended gives them options while showing I'm attentive to their needs. I need to make sure my response is concise but not too short. A simple "I'm good, thanks!" might be too brief, while a long explanation could be overwhelming. Finding that balance is key. Maybe add an emoji to keep it friendly and approachable, but not overdo it. Also, since they mentioned they're doing great, I can acknowledge their positivity and maybe reflect that back to them. It builds rapport and shows I'm listening. Something like "That's wonderful to hear! I'm here and ready to help. What's on your mind today?" That covers gratitude, availability, and opens the door for them to share more. I should avoid any technical jargon or complex language. Keep it simple and conversational. Make sure the response is warm and welcoming. It's important to establish a friendly, supportive tone right from the start. Also, remember to follow all guidelines regarding appropriate content and helpful responses. No need to mention those explicitly, just ensure the response is in line with that. </think> Hi there! � Thank you for the cheerful greeting! While I don't experience feelings quite like humans do, I'm always excited to chat and help out. I'm here and ready to assist you with whatever's on your mind—whether it's a question, a creative idea, or just someone to bounce thoughts off of. What's up? � total duration: 1m31.2075255s load duration: 30.2214ms prompt eval count: 19 token(s) prompt eval duration: 595.6157ms prompt eval rate: 31.90 tokens/s eval count: 469 token(s) eval duration: 1m30.5816884s eval rate: 5.18 tokens/s
返回结果关键的部分已经标红,如下是各项参数的详细解释:
- total duration: 1m31.2075255s (91.2075255 秒)
- 含义: 从开始处理请求到完成整个过程的总时间,包括加载模型、评估提示和生成响应的所有阶段。
- 分析: 总时长约为 1 分 31 秒,表明这是一个相对耗时的操作。结合日志,模型在 <think> 阶段进行了详细的推理(约 1 分钟),这显著延长了总时长。实际生成响应的时间(eval duration)占主导,提示处理和加载时间较短。
- load duration: 30.2214ms (0.0302214 秒)
- 含义: 模型加载到内存中的时间,单位为毫秒。
- 分析: 加载时间仅约 30 毫秒,表明模型可能已预加载到内存中,或者 qwen3:32b(20 GB)的初始加载开销在之前的运行中已完成。此阶段开销极小。
- prompt eval count: 19 token(s)
- 含义: 输入提示(prompt)中处理的 token 总数,即 "Hello, I'm doing great! How about you?" 被分割成的语言单元数。
- 分析: 提示被拆分为 19 个 token,这与输入文本的长度和分词方式(如空格、标点)一致。较长的提示可能增加初始处理复杂度。
- prompt eval duration: 595.6157ms (0.5956157 秒)
- 含义: 处理输入提示所需的时间,单位为毫秒。
- 分析: 约 596 毫秒用于处理 19 个 token,反映了输入阶段的计算开销。结合 <think> 阶段的推理,这部分时间可能包括部分预处理。
- prompt eval rate: 31.90 tokens/s
- 含义: 每秒处理的提示 token 数,计算方式为 prompt eval count / (prompt eval duration / 1000) = 19 / (595.6157 / 1000) ≈ 31.90 tokens/s。
- 分析: 处理提示的效率较高(31.90 tokens/s),表明输入阶段优化良好,但这仅占总流程的很小部分(<1 秒)。
- eval count: 469 token(s)
- 含义: 模型生成响应的 token 总数,即最终输出的 "Hi there! � Thank you for the cheerful greeting! ..." 等内容的 token 数。
- 分析: 生成了 469 个 token,结合日志可见输出较长(包含 <think> 和最终响应)。这表明模型在推理和生成上进行了较多的扩展。
- eval duration: 1m30.5816884s (90.5816884 秒)
- 含义: 生成响应所需的时间,单位为秒。
- 分析: 约 1 分 30 秒用于生成,占总时长的绝大部分(91.2075255s - 0.5956157s - 0.0302214s ≈ 90.58s)。<think> 阶段的详细推理(约 1 分钟)是主要耗时来源,反映了模型在规划响应时的计算密集型工作。
- eval rate: 5.18 tokens/s
- 含义: 每秒生成的 token 数,计算方式为 eval count / eval duration = 469 / 90.5816884 ≈ 5.18 tokens/s。
- 分析: 生成速率较低(5.18 tokens/s),这是因为 qwen3:32b 是一个大型模型(20 GB),在单次推理中结合了复杂的 <think> 逻辑。较低的速率与硬件性能(可能仅使用 CPU)或模型优化程度有关。
补充说明:
综合分析
- 流程分解:
- 加载(30ms)几乎可忽略。
- 提示处理(596ms,19 tokens)快速完成。
- 生成(90.58s,469 tokens)占主导,<think> 阶段的推理显著延长了时长。
- 性能瓶颈: 生成速率(5.18 tokens/s)较低,可能因 qwen3:32b 的规模和 <think> 模式的计算开销。使用 GPU 或更小模型(如 deepseek-r1:1.5b)可能提高效率。
- 上下文影响: 日志显示模型进行了深入的推理(匹配语气、避免技术术语等),这增加了 token 数和时长,但提高了响应质量。
这些指标表明,当前配置下 qwen3:32b 更适合注重质量而非速度的场景。如需优化,可考虑减少 <think> 复杂性或切换到更轻量模型。
方法 2: 使用 Ollama API 并解析响应
- 启动 Ollama 服务:
- 确保 Ollama 服务正在运行:
ollama serve
- 默认监听 http://localhost:11434。
- 发送 API 请求:
- 使用 curl 或其他 HTTP 客户端发送请求,例如:
curl http://localhost:11434/api/generate -d '{ "model": "qwen3:32b", "prompt": "Hello, how are you?", "stream": false }'
- 设置 "stream": false 以获取完整的响应,包括性能数据。
- 解析响应:
- 响应中会包含 eval_count 和 eval_duration,例如:
{ "model": "qwen3:32b", "created_at": "2025-06-13T08:13:00Z", "response": "Hello, I'm doing great! How about you?", "done": true, "eval_count": 12, "eval_duration": 600000000 }
计算 tokens/s:
tokens/s = 12 / (600000000 / 1000000000) = 12 / 0.6 = 20 tokens/s
注意事项
- 硬件影响: tokens/s 受硬件(CPU/GPU)、模型大小和量化级别影响。例如,qwen3:32b(20 GB)比 deepseek-r1:1.5b(1.1 GB)可能需要更多计算资源。
- 输入提示长度: 较长的提示会增加 prompt_eval_duration,但对 eval_duration(生成阶段)影响较小。
- 环境变量: 确保系统有足够的内存和 GPU 支持(如果使用 GPU 加速)。
推荐实践
- 测试多个提示,计算平均 tokens/s 以获得更准确的结果。
- 如果需要自动化测量,可以编写脚本调用 API 并解析响应。
通过以上方法,你可以准确测量 qwen3:32b、llama3.2:latest 或 deepseek-r1:1.5b 的输出 tokens/s。