vLLM部署大模型服务 - 以Qwen3-32B为例
vLLM部署大模型服务
部署环境
- 操作系统:Ubuntu 20.04.4 LTS
- Cuda Version: 12.8,
- Cuda compilation tools, release 12.2, V12.2.91
- 显卡:A100 40G × 2
- Miniconda3
- vllm==0.8.5
服务参数
编写yaml配置文件
# Qwen3-32B.yaml
host: 0.0.0.0
port: 8000
api-key: Qwen
model: /root/models/Qwen3-32B/
enable-lora: true
lora-modules: '{"name": "TCM", "path": "/root/output/TCM/Qwen3-32B/v0-20250520-171040/checkpoint-280/", "base_model_name": "Qwen3-32B"}'
max-model-len: 8192
served-model-name: Qwen3-32B
tensor-parallel-size: 2
gpu-memory-utilization: 0.9
启动服务
CUDA_VISIBLE_DEVICES=4,5 vllm serve --config /root/vllm_config/Qwen3-32B.yaml
客户端测试
# 非流式 + 思考
import json
from openai import OpenAI
prompt = "你好"
model = "Qwen3-32B"
client = OpenAI(
api_key='Qwen',
base_url=f'http://127.0.0.1:8000/v1',
)
try:
completion = client.chat.completions.create(
model=model,
messages=[
{"role": "user", "content": prompt}
],
max_completion_tokens=4096,
temperature=0.7,
stream=False
)
response = json.loads(completion.model_dump_json())
print(response['choices'][0]['message']['content'])
except Exception as e:
print(f"API调用出错: {e}")
# 非流式 + 不思考
import json
from openai import OpenAI
prompt = "\\no_think 你好"
model = "Qwen3-32B"
client = OpenAI(
api_key='Qwen',
base_url=f'http://127.0.0.1:8000/v1',
)
try:
completion = client.chat.completions.create(
model=model,
messages=[
{"role": "user", "content": prompt}
],
max_completion_tokens=4096,
temperature=0.7,
stream=False
)
response = json.loads(completion.model_dump_json())
print(response['choices'][0]['message']['content'])
except Exception as e:
print(f"API调用出错: {e}")
# 流式 + 思考
import json
from openai import OpenAI
prompt = "介绍一下中医。"
model = "Qwen3-32B"
client = OpenAI(
api_key='Qwen',
base_url=f'http://127.0.0.1:8000/v1',
)
try:
completion = client.chat.completions.create(
model=model,
messages=[
{"role": "user", "content": prompt}
],
max_completion_tokens=4096,
temperature=0.7,
stream=True
)
for chunk in completion:
if chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end="", flush=True)
except Exception as e:
print(f"API调用出错: {e}")
吞吐量
throughput: 31.7 tokens/s