vllm-ascend 2/2 -双机推理

  推DeepSeek满血版,需要双机(910B A2),比单机的费劲多了,华为的兄弟一起看了半天才搞定。

一、镜像:

quay.io/ascend/vllm-ascend:v0.12.0rc1
quay.io/ascend/vllm-ascend:v0.12.0rc1-openeuler
quay.nju.edu.cn/ascend/vllm-ascend:v0.11.0rc2

二、权重:需要修改权重目录下config.json中的"torch_dtype": "float32",实测"bfloat16"也行

https://www.modelscope.cn/models/deepseek-ai/DeepSeek-V3.1/

三、启动:多机推理比单机复杂,但比mindie简单。首先启动容器,然后在容器内执行服务启动脚本,主从的服务启动脚本有少量差异。

  • 容器启动脚本start_docker.sh,主从节点一样。启动命令:start_docker.sh vllm-ds
#!/bin/bash

# 定义镜像ID和名称常量
IMAGE_ID=f49277a2e0de

# 检查参数数量是否为1
if [ $# -ne 1 ]; then
    echo "Error: need exactly one argument for container name."
    exit 1
fi

# 容器名称(从参数获取)
CONTAINER_NAME="$1"

# 启动Docker容器
docker run \
    --name "${CONTAINER_NAME}" \
    -it -d \
    --net=host \
    --shm-size=500g \
    --privileged=true \
    -w /home \
    --device=/dev/davinci_manager \
    --device=/dev/hisi_hdc \
    --device=/dev/devmm_svm \
    --entrypoint=bash \
    -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
    -v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
    -v /usr/local/dcmi:/usr/local/dcmi \
    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
    -v /etc/ascend_install.info:/etc/ascend_install.info \
    -v /usr/local/sbin:/usr/local/sbin \
    -v /app1:/app1 \
    -v /tmp:/tmp \
    -v /etc/hccn.conf:/etc/hccn.conf \
    -v /usr/share/zoneinfo/Asia/Shanghai:/etc/localtime:ro \
    -e http_proxy="$http_proxy" \
    -e https_proxy="$https_proxy" \
    "${IMAGE_ID}"
  • 主节点服务脚本node1.sh,主节点宿主机IP:10.178.231.234
#!/bin/sh

# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip of the current node
nic_name="bond0"
local_ip="10.178.231.234"

# AIV
# export HCCL_OP_EXPANSION_MODE="AIV"
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=200
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export VLLM_ASCEND_ENABLE_MLAPO=1
export HCCL_INTRA_PCIE_ENABLE=1
export HCCL_INTRA_ROCE_ENABLE=0

vllm serve /app1/models/DeepSeek-V3.1-w8a8 \
  --host 0.0.0.0 \
  --port 8000 \
  --data-parallel-size 4 \
  --data-parallel-size-local 2 \
  --data-parallel-address 10.178.231.234 \
  --data-parallel-rpc-port 13389 \
  --tensor-parallel-size 4 \
  --quantization ascend \
  --seed 1024 \
  --served-model-name deepseek_v3 \
  --enable-expert-parallel \
  --max-num-seqs 16 \
  --max-model-len 16384 \
  --max-num-batched-tokens 4096 \
  --trust-remote-code \
  --gpu-memory-utilization 0.94 \
  --speculative-config '{"num_speculative_tokens":1,"method":"mtp"}' \
  --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \
  --additional-config '{"ascend_scheduler_config":{"enabled":false},"torchair_graph_config":{"enabled":false}}'
  • 从节点服务脚本node1.sh,从节点宿主机IP:10.178.231.233
#!/bin/sh

# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip of the current node
nic_name="bond0"
local_ip="10.178.231.233"

# AIV
# export HCCL_OP_EXPANSION_MODE="AIV"
export HCCL_IF_IP=$local_ip
export GLOO_SOCKET_IFNAME=$nic_name
export TP_SOCKET_IFNAME=$nic_name
export HCCL_SOCKET_IFNAME=$nic_name
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export VLLM_USE_V1=1
export HCCL_BUFFSIZE=200
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export VLLM_ASCEND_ENABLE_MLAPO=1
export HCCL_INTRA_PCIE_ENABLE=1
export HCCL_INTRA_ROCE_ENABLE=0

vllm serve /app1/models/DeepSeek-V3.1-w8a8 \
  --host 0.0.0.0 \
  --port 8000 \
  --headless \
  --data-parallel-size 4 \
  --data-parallel-size-local 2 \
  --data-parallel-start-rank 2 \
  --data-parallel-address 10.178.231.234 \
  --data-parallel-rpc-port 13389 \
  --tensor-parallel-size 4 \
  --quantization ascend \
  --seed 1024 \
  --served-model-name deepseek_v3\
  --enable-expert-parallel \
  --max-num-seqs 16 \
  --max-model-len 16384 \
  --max-num-batched-tokens 4096 \
  --trust-remote-code \
  --gpu-memory-utilization 0.94 \
  --speculative-config '{"num_speculative_tokens":1,"method":"mtp"}' \
  --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \
  --additional-config '{"ascend_scheduler_config":{"enabled":false},"torchair_graph_config":{"enabled":false}}'

四、测试:

curl http://localhost:8000/v1/completions     -H "Content-Type: application/json"     -d '{
        "model": "deepseek_v3",
        "prompt": "The future of AI is",
        "max_tokens": 50,
        "temperature": 0
    }'

五、巨坑:

  华为提供的脚本怎么都无法拉起服务,根据vllm官网的指导能拉起脚本,但一推理就崩。查了好久,本来都要放弃了,让华为的研发跟进的,结果他们提了一句,换个低版本的镜像试试。结果使用v0.11.0rc2成功了。最后的结论是主机上CANN的版本与镜像兼容问题。。。

  ps:npu-smi 25.2.0,搭配0.11.0rc2的镜像。

六、参考:

https://docs.vllm.ai/projects/ascend/en/latest/tutorials/DeepSeek-V3.1.html

posted @ 2025-12-22 21:41  badwood  阅读(0)  评论(0)    收藏  举报
Badwood's Blog