vllm-ascend

  一直使用mindie框架进行大模型推理,现在vllm可以提供监控指标给Prometheus/Grafana,而且支持Ascend了。

一、镜像:下个镜像真不容易,最后从渡渡鸟上下了:swr.cn-north-4.myhuaweicloud.com/ddn-k8s/quay.io/ascend/vllm-ascend:v0.9.2rc1-linuxarm64

二、启动:修改官网命令:其中权重文件映射到/model/QwQ-32B。并行2张npu

docker run -itd --name qwq-32b-vllm \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache 、
-v /app/model/:/model/:ro \
-p 18034:8000 \
swr.cn-north-4.myhuaweicloud.com/ddn-k8s/quay.io/ascend/vllm-ascend:v0.9.2rc1-linuxarm64 \
vllm serve /model/QwQ-32B --max_model_len 32768 -tp 2 --served-model-name qwq32b

三、测试:

curl -H "Accept: application/json" \
-H "Content-type: application/json" \
-X POST \
-d '{"model": "qwq32b","messages": [{"role": "user", "content": "你是哪个模型?"},{"role": "assistant", "content": "你好"}],"stream": false}'  \
http://localhost:18034/v1/chat/completions

 四、参考:

https://agent.blog.csdn.net/article/details/155446713

https://agent.blog.csdn.net/article/details/155539176

https://quay.io/repository/ascend/vllm-ascend?tab=tags

https://github.com/vllm-project/vllm-ascend

https://docs.vllm.ai/projects/ascend/en/latest/tutorials/multi_npu.html

posted @ 2025-12-12 15:48  badwood  阅读(1)  评论(0)    收藏  举报
Badwood's Blog