vLLM部署大模型服务 - 以Qwen3-32B为例

vLLM部署大模型服务

部署环境

  • 操作系统:Ubuntu 20.04.4 LTS
  • Cuda Version: 12.8,
  • Cuda compilation tools, release 12.2, V12.2.91
  • 显卡:A100 40G × 2
  • Miniconda3
  • vllm==0.8.5

服务参数

文档 - Server Arguments — vLLM

编写yaml配置文件

# Qwen3-32B.yaml

host: 0.0.0.0
port: 8000
api-key: Qwen
model: /root/models/Qwen3-32B/
enable-lora: true
lora-modules: '{"name": "TCM", "path": "/root/output/TCM/Qwen3-32B/v0-20250520-171040/checkpoint-280/", "base_model_name": "Qwen3-32B"}'
max-model-len: 8192
served-model-name: Qwen3-32B
tensor-parallel-size: 2
gpu-memory-utilization: 0.9

启动服务

CUDA_VISIBLE_DEVICES=4,5 vllm serve --config /root/vllm_config/Qwen3-32B.yaml

客户端测试

# 非流式 + 思考
import json
from openai import OpenAI

prompt = "你好"
model = "Qwen3-32B"
client = OpenAI(
    api_key='Qwen',
    base_url=f'http://127.0.0.1:8000/v1',
)

try:
    completion = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "user", "content": prompt}
        ],
        max_completion_tokens=4096,
        temperature=0.7,
        stream=False
    )
    response = json.loads(completion.model_dump_json())
    print(response['choices'][0]['message']['content'])

except Exception as e:
    print(f"API调用出错: {e}")
# 非流式 + 不思考
import json
from openai import OpenAI

prompt = "\\no_think 你好"
model = "Qwen3-32B"
client = OpenAI(
    api_key='Qwen',
    base_url=f'http://127.0.0.1:8000/v1',
)

try:
    completion = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "user", "content": prompt}
        ],
        max_completion_tokens=4096,
        temperature=0.7,
        stream=False
    )
    response = json.loads(completion.model_dump_json())
    print(response['choices'][0]['message']['content'])

except Exception as e:
    print(f"API调用出错: {e}")
# 流式 + 思考
import json
from openai import OpenAI

prompt = "介绍一下中医。"
model = "Qwen3-32B"
client = OpenAI(
    api_key='Qwen',
    base_url=f'http://127.0.0.1:8000/v1',
)

try:
    completion = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "user", "content": prompt}
        ],
        max_completion_tokens=4096,
        temperature=0.7,
        stream=True
    )
    for chunk in completion:
        if chunk.choices[0].delta.content is not None:
            print(chunk.choices[0].delta.content, end="", flush=True)

except Exception as e:
    print(f"API调用出错: {e}")

吞吐量

throughput: 31.7 tokens/s

posted @ 2025-05-23 14:12  一蓑烟雨度平生  阅读(2537)  评论(0)    收藏  举报