vllm-ascend 2/2 -双机推理
推DeepSeek满血版,需要双机(910B A2),比单机的费劲多了,华为的兄弟一起看了半天才搞定。
一、镜像:
quay.io/ascend/vllm-ascend:v0.12.0rc1 quay.io/ascend/vllm-ascend:v0.12.0rc1-openeuler quay.nju.edu.cn/ascend/vllm-ascend:v0.11.0rc2
二、权重:需要修改权重目录下config.json中的"torch_dtype": "float32",实测"bfloat16"也行
https://www.modelscope.cn/models/deepseek-ai/DeepSeek-V3.1/
三、启动:多机推理比单机复杂,但比mindie简单。首先启动容器,然后在容器内执行服务启动脚本,主从的服务启动脚本有少量差异。
- 容器启动脚本start_docker.sh,主从节点一样。启动命令:start_docker.sh vllm-ds
#!/bin/bash # 定义镜像ID和名称常量 IMAGE_ID=f49277a2e0de # 检查参数数量是否为1 if [ $# -ne 1 ]; then echo "Error: need exactly one argument for container name." exit 1 fi # 容器名称(从参数获取) CONTAINER_NAME="$1" # 启动Docker容器 docker run \ --name "${CONTAINER_NAME}" \ -it -d \ --net=host \ --shm-size=500g \ --privileged=true \ -w /home \ --device=/dev/davinci_manager \ --device=/dev/hisi_hdc \ --device=/dev/devmm_svm \ --entrypoint=bash \ -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \ -v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \ -v /usr/local/dcmi:/usr/local/dcmi \ -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ -v /etc/ascend_install.info:/etc/ascend_install.info \ -v /usr/local/sbin:/usr/local/sbin \ -v /app1:/app1 \ -v /tmp:/tmp \ -v /etc/hccn.conf:/etc/hccn.conf \ -v /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime:ro \ -e http_proxy="$http_proxy" \ -e https_proxy="$https_proxy" \ "${IMAGE_ID}"
- 主节点服务脚本node1.sh,主节点宿主机IP:10.178.231.234
#!/bin/sh # this obtained through ifconfig # nic_name is the network interface name corresponding to local_ip of the current node nic_name="bond0" local_ip="10.178.231.234" # AIV # export HCCL_OP_EXPANSION_MODE="AIV" export HCCL_IF_IP=$local_ip export GLOO_SOCKET_IFNAME=$nic_name export TP_SOCKET_IFNAME=$nic_name export HCCL_SOCKET_IFNAME=$nic_name export OMP_PROC_BIND=false export OMP_NUM_THREADS=10 export VLLM_USE_V1=1 export HCCL_BUFFSIZE=200 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export VLLM_ASCEND_ENABLE_MLAPO=1 export HCCL_INTRA_PCIE_ENABLE=1 export HCCL_INTRA_ROCE_ENABLE=0 vllm serve /app1/models/DeepSeek-V3.1-w8a8 \ --host 0.0.0.0 \ --port 8000 \ --data-parallel-size 4 \ --data-parallel-size-local 2 \ --data-parallel-address 10.178.231.234 \ --data-parallel-rpc-port 13389 \ --tensor-parallel-size 4 \ --quantization ascend \ --seed 1024 \ --served-model-name deepseek_v3 \ --enable-expert-parallel \ --max-num-seqs 16 \ --max-model-len 16384 \ --max-num-batched-tokens 4096 \ --trust-remote-code \ --gpu-memory-utilization 0.94 \ --speculative-config '{"num_speculative_tokens":1,"method":"mtp"}' \ --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \ --additional-config '{"ascend_scheduler_config":{"enabled":false},"torchair_graph_config":{"enabled":false}}'
- 从节点服务脚本node1.sh,从节点宿主机IP:10.178.231.233
#!/bin/sh # this obtained through ifconfig # nic_name is the network interface name corresponding to local_ip of the current node nic_name="bond0" local_ip="10.178.231.233" # AIV # export HCCL_OP_EXPANSION_MODE="AIV" export HCCL_IF_IP=$local_ip export GLOO_SOCKET_IFNAME=$nic_name export TP_SOCKET_IFNAME=$nic_name export HCCL_SOCKET_IFNAME=$nic_name export OMP_PROC_BIND=false export OMP_NUM_THREADS=10 export VLLM_USE_V1=1 export HCCL_BUFFSIZE=200 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export VLLM_ASCEND_ENABLE_MLAPO=1 export HCCL_INTRA_PCIE_ENABLE=1 export HCCL_INTRA_ROCE_ENABLE=0 vllm serve /app1/models/DeepSeek-V3.1-w8a8 \ --host 0.0.0.0 \ --port 8000 \ --headless \ --data-parallel-size 4 \ --data-parallel-size-local 2 \ --data-parallel-start-rank 2 \ --data-parallel-address 10.178.231.234 \ --data-parallel-rpc-port 13389 \ --tensor-parallel-size 4 \ --quantization ascend \ --seed 1024 \ --served-model-name deepseek_v3\ --enable-expert-parallel \ --max-num-seqs 16 \ --max-model-len 16384 \ --max-num-batched-tokens 4096 \ --trust-remote-code \ --gpu-memory-utilization 0.94 \ --speculative-config '{"num_speculative_tokens":1,"method":"mtp"}' \ --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \ --additional-config '{"ascend_scheduler_config":{"enabled":false},"torchair_graph_config":{"enabled":false}}'
四、测试:
curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{ "model": "deepseek_v3", "prompt": "The future of AI is", "max_tokens": 50, "temperature": 0 }'
五、巨坑:
华为提供的脚本怎么都无法拉起服务,根据vllm官网的指导能拉起脚本,但一推理就崩。查了好久,本来都要放弃了,让华为的研发跟进的,结果他们提了一句,换个低版本的镜像试试。结果使用v0.11.0rc2成功了。最后的结论是主机上CANN的版本与镜像兼容问题。。。
ps:npu-smi 25.2.0,搭配0.11.0rc2的镜像。
六、参考:
https://docs.vllm.ai/projects/ascend/en/latest/tutorials/DeepSeek-V3.1.html
浙公网安备 33010602011771号