vllm-ascend
一直使用mindie框架进行大模型推理,现在vllm可以提供监控指标给Prometheus/Grafana,而且支持Ascend了。
一、镜像:下个镜像真不容易,最后从渡渡鸟上下了:swr.cn-north-4.myhuaweicloud.com/ddn-k8s/quay.io/ascend/vllm-ascend:v0.9.2rc1-linuxarm64
二、启动:修改官网命令:其中权重文件映射到/model/QwQ-32B。并行2张npu
docker run -itd --name qwq-32b-vllm \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache 、
-v /app/model/:/model/:ro \
-p 18034:8000 \
swr.cn-north-4.myhuaweicloud.com/ddn-k8s/quay.io/ascend/vllm-ascend:v0.9.2rc1-linuxarm64 \
vllm serve /model/QwQ-32B --max_model_len 32768 -tp 2 --served-model-name qwq32b
三、测试:
curl -H "Accept: application/json" \ -H "Content-type: application/json" \ -X POST \ -d '{"model": "qwq32b","messages": [{"role": "user", "content": "你是哪个模型?"},{"role": "assistant", "content": "你好"}],"stream": false}' \ http://localhost:18034/v1/chat/completions
四、参考:
https://agent.blog.csdn.net/article/details/155446713
https://agent.blog.csdn.net/article/details/155539176
https://quay.io/repository/ascend/vllm-ascend?tab=tags
https://github.com/vllm-project/vllm-ascend
https://docs.vllm.ai/projects/ascend/en/latest/tutorials/multi_npu.html
浙公网安备 33010602011771号