vllm,一个超级强大的开源大模型推理引擎

vllm,一个超级强大的开源大模型推理引擎

vllm,一个超级强大的开源大模型推理引擎封面图.png

阅读原文

建议阅读原文,始终查看最新文档版本,获得最佳阅读体验:《vllm,一个超级强大的开源大模型推理引擎》

https://docs.dingtalk.com/i/nodes/wva2dxOW4Y6lZm6oT4v0xrqNVbkz3BRL?corpId=

介绍

vLLM (github stars数为56.5k,遥遥领先其它开源推理引擎)是一个高性能开源推理引擎,专为大语言模型优化,支持高吞吐量、低延迟推理和高效内存管理。其分布式推理性能很好。

DeepSpeed(github stars数接近4万) 和sglang(github stars数接近一万八)也是优秀的开源大模型推理引擎,其功能与vllm类似。

核心功能

高性能推理与服务

  • 拥有业内领先的高吞吐能力(state-of-the-art serving throughput),适配并发请求处理场景。(VLLM Docs)

  • 通过 连续批次处理(Continuous Batching),动态聚合输入请求,提升 GPU 利用率,同时降低延迟。(VLLM DocsRamp)

高效内存管理 —— PagedAttention

  • 引入 PagedAttention,将 KV-cache 分为固定大小块(analogous to OS memory paging),从而减少内存碎片和冗余复制,显著提升内存使用效率与吞吐性能。(VLLM DocsZerohertzarXiv)

量化支持

  • 支持多种量化格式:GPTQ、AWQ、INT4、INT8、FP8等,有助于显著减少模型推理过程中的计算与内存开销。(VLLM DocsGitHub)

优化执行与解码

  • 利用 CUDA/HIP 图(CUDA/HIP graph)、优化的 CUDA 内核(集成 FlashAttention、FlashInfer),加速模型推理。(VLLM DocsGitHub)

  • 支持 Speculative Decoding 与 Chunked Prefill,加快标记生成速度与响应速度。(VLLM DocsGitHubMedium)

广泛适配与部署支持

  • 无缝集成 Hugging Face 模型库(如 Llama、Mixture-of-Expert、Embedding、Multi-modal 模型等),支持模型迁移与使用便利。(GitHub)

  • 提供多种 并行策略支持:包括 tensor、pipeline、data、expert 并行,便于大规模分布式部署。(VLLM DocsGitHub)

  • 支持 流式输出(Streaming Outputs),满足实时交互需求。(VLLM DocsGitHub)

  • 提供 OpenAI 兼容 API 服务器,可用作 OpenAI API 的 “无痛替代”,便于现有系统接入。(VLLM Docs)

  • 支持多种硬件平台:包括 NVIDIA GPU、AMD GPU/CPU、Intel CPU、Gaudi® 加速器、IBM Power CPU、TPU、AWS Trainium/Inferentia 等。(VLLM DocsGitHub)

  • 支持 前缀缓存(Prefix Caching) 与 多 LoRA(Multi-LoRA),适应微调与高效重用场景。(VLLM DocsGitHub)

工具调用能力

  • 从 vLLM ≥0.8.3 版本起支持在对话中进行函数调用设定(named function calling)、auto / required / none 等 tool_choice 模式,可用于 LLM 工具集成。(VLLM Docs)

核心优势

显著提升性能与资源利用效率

相比传统 HuggingFace Transformer,PagedAttention 可带来高达 2–4 倍或更高的吞吐率提升,长上下文处理表现尤为显著。(arXivZerohertz)

Medium 报道称综合机制可实现“高达 30× 更快服务速度”、节省 60–80% 内存浪费、2.8× 更快 token 生成速度。(Medium)

Ramp 称其可提升并发请求处理能力,并减少 GPU 使用与整体服务成本,是成本效益明显的推理引擎解决方案。(Ramp)

开发与生态兼容性强

与 Hugging Face 和 OpenAI API 的兼容降低迁移成本,现有系统可快速部署。(RampVLLM Docs)

支持多硬件与部署场景

支持主流加速器与 GPU 平台,满足多场景部署需求,突破硬件限制。(VLLM DocsGitHub)

官网

vLLM - vLLM

github网址

vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs

用uv安装vllm(官方推荐)

(可选)安装 CUDA Toolkit(带 **nvcc**):

也可以不用手工安装nvidia-cuda-toolkit,因为vllm本身就内置了合适版本的nvidia-cuda-toolkit

#sudo apt update
#sudo apt install -y nvidia-cuda-toolkit

安装 C/C++ 编译器

sudo apt install -y build-essential

通过uv pip安装vllm

官方文档:GPU - vLLM

mkdir vllm
cd vllm
uv venv --python 3.12 --seed #注意必须指定3.12版本,如果高于3.12,则运行大模型时会报错,后续vllm会支持更高版本的Python
source .venv/bin/activate

uv pip install vllm --torch-backend=auto

image.png

运行大模型(单节点)

export HF_ENDPOINT=https://hf-mirror.com
#这个模型很小,适合用于进行测试验证
MODEL_NAME="facebook/opt-125m"
PORT=8000

uv run --with vllm \
    python -m vllm.entrypoints.openai.api_server \
    --model $MODEL_NAME \
    --port $PORT \
    --host 0.0.0.0 \
    --trust-remote-code

image.png

验证

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "facebook/opt-125m",
        "prompt": "Hello, how are you today?",
        "max_tokens": 50
      }' | jq .

image.png

常用API

查询vllm中大模型列表

curl http://localhost:8000/v1/models

关于分布式推理的注意事项

网络带宽很重要,虽然vllm并不强制要求有infiniteband网络,普通的千兆局域网也能运行,但是网络带宽越大,性能就越好。我在实验过程中,发现节点间的通信开销很大,带宽全占满了(我实验过程中使用的是千兆局域网)。最好是能上万兆,甚至25Gb。当然有条件最好能上infiniteband。

分布式推理(使用docker)

参考资料:Multi-Node-Serving - vLLM

官方提供了一个sh脚本,方便用户进行分布式推理,脚本本质上是将vllm装到了docker容器中的,所以首先要安装docker和NVIDIA container toolkit。

安装docker

注意:我写作本文时,docker desktop对GPU的支持还只是针对Windows上的WSL,也就是或Linux系统暂不支持,所以要确保安装docker engine而不是docker desktop。https://docs.docker.com/desktop/features/gpu/#validate-gpu-support

关于docker安装教程请参考:《安装docker》

安装和配置NVIDIA container toolkit

官方文档:Installing the NVIDIA Container Toolkit — NVIDIA Container Toolkit

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update

export NVIDIA_CONTAINER_TOOLKIT_VERSION=1.17.8-1
  sudo apt-get install -y \
      nvidia-container-toolkit=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
      nvidia-container-toolkit-base=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
      libnvidia-container-tools=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
      libnvidia-container1=${NVIDIA_CONTAINER_TOOLKIT_VERSION}

配置NVIDIA container toolkit


sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

nvidia-ctk runtime configure --runtime=docker --config=$HOME/.config/docker/daemon.json
sudo systemctl restart docker
sudo nvidia-ctk config --set nvidia-container-cli.no-cgroups --in-place

验证

sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

如果输出报错,请查看本文“故障诊断”部分

image.png

head节点运行脚本

官方提供了直接可用的shell脚本,可以在github上下载:Run Cluster - vLLM  vllm/examples/online_serving/run_cluster.sh at main · vllm-project/vllm

下载后上传到节点的主目录,然后记得添加执行权限:chmod +x ./run_cluster.sh

下面的命令最终vllm推理会报错,正确的命令往下看,下文还会描述我解决故障的过程。

#10.65.37.234指的是head节点的ip地址。VLLM_HOST_IP的值必须是运行脚本的主机的ip地址,不同节点,此环境变量的值不同。
bash run_cluster.sh \
                vllm/vllm-openai \
                10.65.37.234 \
                --head \
                /home/ubuntu/.cache/huggingface \
                -e VLLM_HOST_IP=10.65.37.234

下面的命令成功

bash run_cluster.sh \
                vllm/vllm-openai \
                10.65.37.234 \
                --head \
                /home/ubuntu/.cache/huggingface \
                -e VLLM_HOST_IP=10.65.37.234 \
                -e GLOO_SOCKET_IFNAME=enp0s31f6 \
                -e VLLM_LOGGING_LEVEL=DEBUG \
                -e HF_ENDPOINT=https://hf-mirror.com \
                -e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
                -e VLLM_ATTENTION_BACKEND=FLASHINFER \
                -e VLLM_DISABLE_VIDEO_PROCESSING=1

容器镜像非常大

image.png

worker节点运行脚本

下面的命令最终vllm推理会报错,正确的命令往下看


bash run_cluster.sh \
                vllm/vllm-openai \
                10.65.37.234 \
                --worker \
                /home/ubuntu/.cache/huggingface \
                -e VLLM_HOST_IP=10.65.37.233 \
                -e HF_ENDPOINT=https://hf-mirror.com \
                -e VLLM_LOGGING_LEVEL=DEBUG

下面的命令成功

bash run_cluster.sh \
                vllm/vllm-openai \
                10.65.37.234 \
                --worker \
                /home/ubuntu/.cache/huggingface \
                -e VLLM_HOST_IP=10.65.37.233 \
                -e GLOO_SOCKET_IFNAME=enp0s31f6 \
                -e VLLM_LOGGING_LEVEL=DEBUG \
                -e HF_ENDPOINT=https://hf-mirror.com \
                -e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
                -e VLLM_ATTENTION_BACKEND=FLASHINFER \
                -e VLLM_DISABLE_VIDEO_PROCESSING=1

检查RAY集群状态

在任意一个节点上进入docker容器

#新打开一个终端,然后可以用下行命令查询当前有哪些容器正在运行docker ps
docker ps -a

如下图,node-开头的就是vllm-openai容器了

image.png

进入容器执行命令

docker exec -it node-17012 /bin/bash

ray status
ray list nodes #此命令如果报错是正常的,因为官方提供的容器默认没有启用RAY dashboard

从输出可以看到,总共有两个节点,有48核CPU ,两个GPU,内存差不多105GB。

image.png

运行vllm serve命令开始进行推理

目前,ray集群已经建立好了,可以在任意一个节点执行推理命令,就跟平时单节点推理差不多。

#下面的命令中--tensor-parallel-size的值必须是单个节点的GPU总数,--pipeline-parallel-size指的是ray集群总共有多少个节点。
#~/.cache/huggingface/hub/models--ByteDance-Seed--UI-TARS-1.5-7B/snapshots/683d002dd99d8f95104d31e70391a39348857f4e指的是大模型的存储路径,如果不知道,可以用hf cache scan命令查询。
vllm serve  ~/.cache/huggingface/hub/models--ByteDance-Seed--UI-TARS-1.5-7B/snapshots/683d002dd99d8f95104d31e70391a39348857f4e \
     --tensor-parallel-size 1 \
     --pipeline-parallel-size 2

然而

报错了,如下。这个报错跟我不用docker容器进行分布式推理时的报错完全一样。

RuntimeError: Gloo connectFullMesh failed with [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:144] no error

我添加了一个环境变量(GLOO_SOCKET_IFNAME 这个环境变量,作用是我们可以主动指定要用于节点间通信的网卡名称),上面的错误没有了,但是出现了OOM错误

#可以用ip a命令查询网卡名称
bash run_cluster.sh \
                vllm/vllm-openai \
                10.65.37.234 \
                --head \
                /home/ubuntu/.cache/huggingface \
                -e VLLM_HOST_IP=10.65.37.234 \
                -e GLOO_SOCKET_IFNAME=enp0s31f6 \
                -e VLLM_LOGGING_LEVEL=DEBUG

报错信息如下:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 960.00 MiB. GPU 0 has a total capacity of 11.62 GiB of which 62.69 MiB is free. Including non-PyTorch memory, this process has 11.55 GiB memory in use. Of the allocated memory 11.25 GiB is allocated by PyTorch, and 132.55 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
(RayWorkerWrapper pid=544) The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
(RayWorkerWrapper pid=544) You have video processor config saved in `preprocessor.json` file which is deprecated. Video processor configs should be saved in their own `video_preprocessor.json` file. You can rename the file or load and save the processor back which renames it automatically. Loading from `preprocessor.json` will be removed in v5.0.

再次调整命令,这一次没有报错,vllm正常启动了

因为UI-TARS是视觉语言模型,所以比纯文字的模型占用的显存更高。

#下面没有指定--pipeline-parallel-size,这也是可以的,官方文档有说明:https://docs.vllm.ai/en/stable/serving/parallelism_scaling.html#running-vllm-on-a-ray-cluster
vllm serve  ~/.cache/huggingface/hub/models--ByteDance-Seed--UI-TARS-1.5-7B/snapshots/683d002dd99d8f95104d31e70391a39348857f4e \
     --tensor-parallel-size 2 \
     --dtype bfloat16 \
     --max-model-len 8192 \
     --max-num-seqs 2 \
     --trust-remote-code

#注意以上命令指定大模型时,是直接指定了大模型的路径,这种用法不常见,更加常见的做法是指定huggingface上的大模型名称,如下所示:
#我还发现,如果指定大模型路径,则用curl调用大模型服务正常,但是在agent-tars中就会报错了。
#经过验证,将--max-model-len设置为40000也是可以的(我的环境是两台工作站,每一台都有一张12GB显存的显卡)
vllm serve ByteDance-Seed/UI-TARS-1.5-7B \
     --tensor-parallel-size 2 \
     --dtype bfloat16 \
     --max-model-len 8192 \
     --max-num-seqs 2 \
     --trust-remote-code


vllm serve ByteDance-Seed/UI-TARS-1.5-7B \
  --tensor-parallel-size 2 \
  --dtype bfloat16 \
  --max-model-len 40000 \
  --gpu-memory-utilization 0.95 \
  --max-num-seqs 1 \
  --trust-remote-code \
  --api-key your_api_key

完整输出如下:

INFO 08-18 01:30:49 [__init__.py:235] Automatically detected platform cuda.
INFO 08-18 01:30:50 [api_server.py:1755] vLLM API server version 0.10.1.dev1+gbcc0a3cbe
INFO 08-18 01:30:50 [cli_args.py:261] non-default args: {'model_tag': '/root/.cache/huggingface/hub/models--ByteDance-Seed--UI-TARS-1.5-7B/snapshots/683d002dd99d8f95104d31e70391a39348857f4e', 'model': '/root/.cache/huggingface/hub/models--ByteDance-Seed--UI-TARS-1.5-7B/snapshots/683d002dd99d8f95104d31e70391a39348857f4e', 'trust_remote_code': True, 'dtype': 'bfloat16', 'max_model_len': 8192, 'tensor_parallel_size': 2, 'max_num_seqs': 2}
INFO 08-18 01:30:53 [config.py:1604] Using max model len 8192
INFO 08-18 01:30:53 [config.py:2434] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 08-18 01:30:55 [__init__.py:235] Automatically detected platform cuda.
INFO 08-18 01:30:56 [core.py:572] Waiting for init message from front-end.
INFO 08-18 01:30:56 [core.py:71] Initializing a V1 LLM engine (v0.10.1.dev1+gbcc0a3cbe) with config: model='/root/.cache/huggingface/hub/models--ByteDance-Seed--UI-TARS-1.5-7B/snapshots/683d002dd99d8f95104d31e70391a39348857f4e', speculative_config=None, tokenizer='/root/.cache/huggingface/hub/models--ByteDance-Seed--UI-TARS-1.5-7B/snapshots/683d002dd99d8f95104d31e70391a39348857f4e', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/root/.cache/huggingface/hub/models--ByteDance-Seed--UI-TARS-1.5-7B/snapshots/683d002dd99d8f95104d31e70391a39348857f4e, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":4,"local_cache_dir":null}
2025-08-18 01:30:56,842 INFO worker.py:1747 -- Connecting to existing Ray cluster at address: 10.65.37.234:6379...
2025-08-18 01:30:56,849 INFO worker.py:1927 -- Connected to Ray cluster.
INFO 08-18 01:30:56 [ray_utils.py:336] No current placement group found. Creating a new placement group.
WARNING 08-18 01:30:56 [ray_utils.py:200] tensor_parallel_size=2 is bigger than a reserved number of GPUs (1 GPUs) in a node 9a2a931b37c58f6270d1330b25db8cd4579495063788c715cd4641d2. Tensor parallel workers can be spread out to 2+ nodes which can degrade the performance unless you have fast interconnect across nodes, like Infiniband. To resolve this issue, make sure you have more than 2 GPUs available at each node.
WARNING 08-18 01:30:56 [ray_utils.py:200] tensor_parallel_size=2 is bigger than a reserved number of GPUs (1 GPUs) in a node 51ca02f8a0c39dbd017cdc6d95562729994b17689e7db975666c1011. Tensor parallel workers can be spread out to 2+ nodes which can degrade the performance unless you have fast interconnect across nodes, like Infiniband. To resolve this issue, make sure you have more than 2 GPUs available at each node.
INFO 08-18 01:30:56 [ray_distributed_executor.py:169] use_ray_spmd_worker: True
(pid=551) INFO 08-18 01:30:58 [__init__.py:235] Automatically detected platform cuda.
INFO 08-18 01:30:59 [ray_env.py:63] RAY_NON_CARRY_OVER_ENV_VARS from config: set()
INFO 08-18 01:30:59 [ray_env.py:65] Copying the following environment variables to workers: ['LD_LIBRARY_PATH', 'VLLM_WORKER_MULTIPROC_METHOD', 'VLLM_USE_RAY_SPMD_WORKER', 'VLLM_USE_V1', 'VLLM_USE_RAY_COMPILED_DAG', 'VLLM_USAGE_SOURCE']
INFO 08-18 01:30:59 [ray_env.py:68] If certain env vars should NOT be copied, add them to /root/.config/vllm/ray_non_carry_over_env_vars.json file
(RayWorkerWrapper pid=551) INFO 08-18 01:31:01 [__init__.py:1375] Found nccl from library libnccl.so.2
(RayWorkerWrapper pid=551) INFO 08-18 01:31:01 [pynccl.py:70] vLLM is using nccl==2.26.2
(RayWorkerWrapper pid=551) WARNING 08-18 01:31:01 [custom_all_reduce.py:85] Custom allreduce is disabled because this process group spans across nodes.
(RayWorkerWrapper pid=551) INFO 08-18 01:31:01 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[], buffer_handle=None, local_subscribe_addr=None, remote_subscribe_addr='tcp://10.65.37.234:35057', remote_addr_ipv6=False)
(RayWorkerWrapper pid=551) INFO 08-18 01:31:01 [parallel_state.py:1102] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(RayWorkerWrapper pid=1062, ip=10.65.37.233) WARNING 08-18 01:31:01 [profiling.py:276] The sequence length (8192) is smaller than the pre-defined worst-case total number of multimodal tokens (32768). This may cause certain multi-modal inputs to fail during inference. To avoid this, you should increase `max_model_len` or reduce `mm_counts`.
(RayWorkerWrapper pid=1062, ip=10.65.37.233) INFO 08-18 01:31:01 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling.
(RayWorkerWrapper pid=1062, ip=10.65.37.233) INFO 08-18 01:31:01 [gpu_model_runner.py:1843] Starting to load model /root/.cache/huggingface/hub/models--ByteDance-Seed--UI-TARS-1.5-7B/snapshots/683d002dd99d8f95104d31e70391a39348857f4e...
(RayWorkerWrapper pid=551) INFO 08-18 01:31:01 [gpu_model_runner.py:1875] Loading model from scratch...
(RayWorkerWrapper pid=551) WARNING 08-18 01:31:01 [vision.py:91] Current `vllm-flash-attn` has a bug inside vision module, so we use xformers backend instead. You can run `pip install flash-attn` to use flash-attention backend.
(RayWorkerWrapper pid=551) INFO 08-18 01:31:01 [cuda.py:290] Using Flash Attention backend on V1 engine.
Loading safetensors checkpoint shards:   0% Completed | 0/7 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  14% Completed | 1/7 [00:00<00:02,  2.92it/s]
Loading safetensors checkpoint shards:  29% Completed | 2/7 [00:00<00:02,  2.24it/s]
Loading safetensors checkpoint shards:  43% Completed | 3/7 [00:01<00:01,  2.31it/s]
Loading safetensors checkpoint shards:  57% Completed | 4/7 [00:01<00:01,  2.16it/s]
Loading safetensors checkpoint shards:  71% Completed | 5/7 [00:02<00:00,  2.08it/s]
Loading safetensors checkpoint shards:  86% Completed | 6/7 [00:02<00:00,  2.04it/s]
Loading safetensors checkpoint shards: 100% Completed | 7/7 [00:03<00:00,  2.02it/s]
Loading safetensors checkpoint shards: 100% Completed | 7/7 [00:03<00:00,  2.11it/s]
(RayWorkerWrapper pid=551)
(RayWorkerWrapper pid=551) INFO 08-18 01:31:04 [default_loader.py:262] Loading weights took 3.34 seconds
(pid=1062, ip=10.65.37.233) INFO 08-18 01:30:58 [__init__.py:235] Automatically detected platform cuda.
(RayWorkerWrapper pid=551) INFO 08-18 01:31:05 [gpu_model_runner.py:1892] Model loading took 7.8689 GiB and 3.426220 seconds
(RayWorkerWrapper pid=1062, ip=10.65.37.233) INFO 08-18 01:31:00 [__init__.py:1375] Found nccl from library libnccl.so.2
(RayWorkerWrapper pid=1062, ip=10.65.37.233) INFO 08-18 01:31:00 [pynccl.py:70] vLLM is using nccl==2.26.2
(RayWorkerWrapper pid=1062, ip=10.65.37.233) WARNING 08-18 01:31:01 [custom_all_reduce.py:85] Custom allreduce is disabled because this process group spans across nodes.
(RayWorkerWrapper pid=1062, ip=10.65.37.233) INFO 08-18 01:31:01 [parallel_state.py:1102] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
(RayWorkerWrapper pid=551) WARNING 08-18 01:31:01 [profiling.py:276] The sequence length (8192) is smaller than the pre-defined worst-case total number of multimodal tokens (32768). This may cause certain multi-modal inputs to fail during inference. To avoid this, you should increase `max_model_len` or reduce `mm_counts`.
(RayWorkerWrapper pid=551) INFO 08-18 01:31:01 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling.
(RayWorkerWrapper pid=551) INFO 08-18 01:31:01 [gpu_model_runner.py:1843] Starting to load model /root/.cache/huggingface/hub/models--ByteDance-Seed--UI-TARS-1.5-7B/snapshots/683d002dd99d8f95104d31e70391a39348857f4e...
(RayWorkerWrapper pid=1062, ip=10.65.37.233) INFO 08-18 01:31:01 [gpu_model_runner.py:1875] Loading model from scratch...
(RayWorkerWrapper pid=1062, ip=10.65.37.233) WARNING 08-18 01:31:01 [vision.py:91] Current `vllm-flash-attn` has a bug inside vision module, so we use xformers backend instead. You can run `pip install flash-attn` to use flash-attention backend.
(RayWorkerWrapper pid=1062, ip=10.65.37.233) INFO 08-18 01:31:01 [cuda.py:290] Using Flash Attention backend on V1 engine.
(RayWorkerWrapper pid=551) INFO 08-18 01:31:06 [gpu_model_runner.py:2380] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
(RayWorkerWrapper pid=551) The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
(RayWorkerWrapper pid=551) You have video processor config saved in `preprocessor.json` file which is deprecated. Video processor configs should be saved in their own `video_preprocessor.json` file. You can rename the file or load and save the processor back which renames it automatically. Loading from `preprocessor.json` will be removed in v5.0.
(RayWorkerWrapper pid=1062, ip=10.65.37.233) INFO 08-18 01:33:59 [backends.py:530] Using cache directory: /root/.cache/vllm/torch_compile_cache/c0320d5708/rank_1_0/backbone for vLLM's torch.compile
(RayWorkerWrapper pid=1062, ip=10.65.37.233) INFO 08-18 01:33:59 [backends.py:541] Dynamo bytecode transform time: 2.73 s
(RayWorkerWrapper pid=1062, ip=10.65.37.233) INFO 08-18 01:31:06 [default_loader.py:262] Loading weights took 4.98 seconds
(RayWorkerWrapper pid=1062, ip=10.65.37.233) INFO 08-18 01:31:06 [gpu_model_runner.py:1892] Model loading took 7.8689 GiB and 5.065658 seconds
(RayWorkerWrapper pid=1062, ip=10.65.37.233) INFO 08-18 01:31:06 [gpu_model_runner.py:2380] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
(RayWorkerWrapper pid=1062, ip=10.65.37.233) INFO 08-18 01:34:00 [backends.py:194] Cache the graph for dynamic shape for later use
(RayWorkerWrapper pid=1062, ip=10.65.37.233) [rank1]:W0818 01:34:00.831000 1062 torch/_inductor/utils.py:1250] [0/0] Not enough SMs to use max_autotune_gemm mode
(RayWorkerWrapper pid=1062, ip=10.65.37.233) The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
(RayWorkerWrapper pid=1062, ip=10.65.37.233) You have video processor config saved in `preprocessor.json` file which is deprecated. Video processor configs should be saved in their own `video_preprocessor.json` file. You can rename the file or load and save the processor back which renames it automatically. Loading from `preprocessor.json` will be removed in v5.0.
(RayWorkerWrapper pid=1062, ip=10.65.37.233) INFO 08-18 01:34:10 [backends.py:215] Compiling a graph for dynamic shape takes 11.68 s
(RayWorkerWrapper pid=551) INFO 08-18 01:33:59 [backends.py:530] Using cache directory: /root/.cache/vllm/torch_compile_cache/c0320d5708/rank_0_0/backbone for vLLM's torch.compile
(RayWorkerWrapper pid=551) INFO 08-18 01:33:59 [backends.py:541] Dynamo bytecode transform time: 2.76 s
(RayWorkerWrapper pid=551) INFO 08-18 01:34:00 [backends.py:194] Cache the graph for dynamic shape for later use
(RayWorkerWrapper pid=551) INFO 08-18 01:34:20 [monitor.py:34] torch.compile takes 14.69 s in total
(RayWorkerWrapper pid=551) INFO 08-18 01:34:11 [backends.py:215] Compiling a graph for dynamic shape takes 11.94 s
(RayWorkerWrapper pid=1062, ip=10.65.37.233) /usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
(RayWorkerWrapper pid=1062, ip=10.65.37.233) If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
(RayWorkerWrapper pid=1062, ip=10.65.37.233)   warnings.warn(
(RayWorkerWrapper pid=551) [rank0]:W0818 01:34:00.909000 551 torch/_inductor/utils.py:1250] [0/0] Not enough SMs to use max_autotune_gemm mode
(RayWorkerWrapper pid=551) INFO 08-18 01:34:21 [gpu_worker.py:255] Available KV cache memory: 0.69 GiB
INFO 08-18 01:34:21 [kv_cache_utils.py:833] GPU KV cache size: 25,968 tokens
INFO 08-18 01:34:21 [kv_cache_utils.py:837] Maximum concurrency for 8,192 tokens per request: 3.17x
INFO 08-18 01:34:21 [kv_cache_utils.py:833] GPU KV cache size: 26,256 tokens
INFO 08-18 01:34:21 [kv_cache_utils.py:837] Maximum concurrency for 8,192 tokens per request: 3.21x
Capturing CUDA graph shapes:   0%|          | 0/3 [00:00<?, ?it/s]
Capturing CUDA graph shapes:  33%|███▎      | 1/3 [00:00<00:00,  8.03it/s]
(RayWorkerWrapper pid=1062, ip=10.65.37.233) INFO 08-18 01:34:22 [gpu_model_runner.py:2485] Graph capturing finished in 0 secs, took 0.08 GiB
INFO 08-18 01:34:22 [core.py:193] init engine (profile, create kv cache, warmup model) took 195.65 seconds
Capturing CUDA graph shapes: 100%|██████████| 3/3 [00:00<00:00, 11.34it/s]
WARNING 08-18 01:34:22 [profiling.py:276] The sequence length (8192) is smaller than the pre-defined worst-case total number of multimodal tokens (32768). This may cause certain multi-modal inputs to fail during inference. To avoid this, you should increase `max_model_len` or reduce `mm_counts`.
INFO 08-18 01:34:23 [loggers.py:141] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 1623
INFO 08-18 01:34:23 [api_server.py:1818] Starting vLLM API server 0 on http://0.0.0.0:8000
INFO 08-18 01:34:23 [launcher.py:29] Available routes are:
INFO 08-18 01:34:23 [launcher.py:37] Route: /openapi.json, Methods: GET, HEAD
INFO 08-18 01:34:23 [launcher.py:37] Route: /docs, Methods: GET, HEAD
INFO 08-18 01:34:23 [launcher.py:37] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 08-18 01:34:23 [launcher.py:37] Route: /redoc, Methods: GET, HEAD
INFO 08-18 01:34:23 [launcher.py:37] Route: /health, Methods: GET
INFO 08-18 01:34:23 [launcher.py:37] Route: /load, Methods: GET
INFO 08-18 01:34:23 [launcher.py:37] Route: /ping, Methods: POST
INFO 08-18 01:34:23 [launcher.py:37] Route: /ping, Methods: GET
INFO 08-18 01:34:23 [launcher.py:37] Route: /tokenize, Methods: POST
INFO 08-18 01:34:23 [launcher.py:37] Route: /detokenize, Methods: POST
INFO 08-18 01:34:23 [launcher.py:37] Route: /v1/models, Methods: GET
INFO 08-18 01:34:23 [launcher.py:37] Route: /version, Methods: GET
INFO 08-18 01:34:23 [launcher.py:37] Route: /v1/responses, Methods: POST
INFO 08-18 01:34:23 [launcher.py:37] Route: /v1/responses/{response_id}, Methods: GET
INFO 08-18 01:34:23 [launcher.py:37] Route: /v1/responses/{response_id}/cancel, Methods: POST
INFO 08-18 01:34:23 [launcher.py:37] Route: /v1/chat/completions, Methods: POST
INFO 08-18 01:34:23 [launcher.py:37] Route: /v1/completions, Methods: POST
INFO 08-18 01:34:23 [launcher.py:37] Route: /v1/embeddings, Methods: POST
INFO 08-18 01:34:23 [launcher.py:37] Route: /pooling, Methods: POST
INFO 08-18 01:34:23 [launcher.py:37] Route: /classify, Methods: POST
INFO 08-18 01:34:23 [launcher.py:37] Route: /score, Methods: POST
INFO 08-18 01:34:23 [launcher.py:37] Route: /v1/score, Methods: POST
INFO 08-18 01:34:23 [launcher.py:37] Route: /v1/audio/transcriptions, Methods: POST
INFO 08-18 01:34:23 [launcher.py:37] Route: /v1/audio/translations, Methods: POST
INFO 08-18 01:34:23 [launcher.py:37] Route: /rerank, Methods: POST
INFO 08-18 01:34:23 [launcher.py:37] Route: /v1/rerank, Methods: POST
INFO 08-18 01:34:23 [launcher.py:37] Route: /v2/rerank, Methods: POST
INFO 08-18 01:34:23 [launcher.py:37] Route: /scale_elastic_ep, Methods: POST
INFO 08-18 01:34:23 [launcher.py:37] Route: /is_scaling_elastic_ep, Methods: POST
INFO 08-18 01:34:23 [launcher.py:37] Route: /invocations, Methods: POST
INFO 08-18 01:34:23 [launcher.py:37] Route: /metrics, Methods: GET
INFO:     Started server process [3283]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

验证:

#注意:可以通过curl http://10.65.37.234:8000/v1/models | jq .查询模型名称
curl http://10.65.37.234:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "/root/.cache/huggingface/hub/models--ByteDance-Seed--UI-TARS-1.5-7B/snapshots/683d002dd99d8f95104d31e70391a39348857f4e",
        "prompt": "Hello, how are you today?"
      }' | jq .

输出:

{
  "id": "cmpl-e0a56cf1fb804681b32bca2c57a353ea",
  "object": "text_completion",
  "created": 1755507536,
  "model": "/root/.cache/huggingface/hub/models--ByteDance-Seed--UI-TARS-1.5-7B/snapshots/683d002dd99d8f95104d31e70391a39348857f4e",
  "choices": [
    {
      "index": 0,
      "text": " I'm doing well, thank you for asking! How can I assist you with",
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null,
      "prompt_logprobs": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 7,
    "total_tokens": 23,
    "completion_tokens": 16,
    "prompt_tokens_details": null
  },
  "kv_transfer_params": null
}

vllm输出了如下日志:

image.png

再次验证

curl -X POST "http://10.65.37.234:8000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ByteDance-Seed/UI-TARS-1.5-7B",
    "messages": [{"role": "user", "content": "Hello"}],
    "max_tokens": 100
  }' | jq .
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1033  100   904  100   129     74     10  0:00:12  0:00:12 --:--:--   222
{
  "id": "chatcmpl-24fb0ae378444765998590c9ec17b1e3",
  "object": "chat.completion",
  "created": 1755515237,
  "model": "ByteDance-Seed/UI-TARS-1.5-7B",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! How can I assist you today? I'm here to answer any questions you have about a variety of topics. Whether you need help with a specific task, are looking for information on a particular subject, or just want to have a conversation, please let me know. I'll do my best to provide you with useful resources, explanations, or support.",
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning_content": null
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 20,
    "total_tokens": 93,
    "completion_tokens": 73,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "kv_transfer_params": null
}

运行CognitiveKernel/Qwen3-8B-CK-Pro大模型

vllm serve CognitiveKernel/Qwen3-8B-CK-Pro   --tensor-parallel-size 2    --max-model-len 8192      --max-num-seqs 2      --trust-remote-code

分布式推理(不用docker)

官方推荐使用docker进行分布式推理,因为不用担心复杂的环境问题,当然了,也提供了shell脚本,演示如何直接在多台主机间进行分布式推理。脚本可以在github上查阅:vllm/examples/online_serving/multi-node-serving.sh at main · vllm-project/vllm

官方的说明:Multi-Node-Serving - vLLM

然而,我按照官方提供的脚本进行分布式推理时,总是报错:

RuntimeError: Gloo connectFullMesh failed with [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:144] no error

问了AI,说是网络通信失败,而且说与ray集群无关,是vllm出了故障,后来我用docker进行分布式推理时也发现了同样的错误(参考:《vllm,一个超级强大的开源大模型推理引擎》),我设置了环境变量GLOO_SOCKET_IFNAME=enp0s31f6(注意:网卡名称根据实际情况填写)后,问题得到了解决。因此本小节的问题应该也可以通过相同的解决方法得到解决,我还没有验证。

分布式推理(通过k8s集群)

详情请阅读此文:《自动化部署k8s,自动化部署监控、告警、日志组件以及在k8s上用vLLM部署大模型详细图文教程》

用conda安装vllm(旧版,不推荐)

旧版的vllm,官网上写着建议用conda安装vllm,但后来改了,官方不建议用conda,而是用uv,下面的内容是我之前写的,仅作为记录。

安装conda

参考资料:

Installing Miniconda - Anaconda

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash ~/Miniconda3-latest-Linux-x86_64.sh

用conda创建虚拟环境并安装vllm

参考资料:Quickstart — vLLM

conda create -n myenv python=3.12 -y
conda activate myenv
pip install vllm

vllm在单节点上运行大模型(旧文档)

设置环境变量

export HF_ENDPOINT=https://hf-mirror.com

vllm serve运行指定的大模型

vllm serve Qwen/Qwen2.5-1.5B-Instruct

输出如下:

(myenv) root@iv-ydrr1odszkwh2yo6rxhy:~#  vllm serve Qwen/Qwen2.5-1.5B-Instruct
INFO 04-24 14:37:03 [__init__.py:239] Automatically detected platform cuda.
INFO 04-24 14:37:04 [api_server.py:1034] vLLM API server version 0.8.4
INFO 04-24 14:37:04 [api_server.py:1035] args: Namespace(subparser='serve', model_tag='Qwen/Qwen2.5-1.5B-Instruct', config='', host=None, port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='Qwen/Qwen2.5-1.5B-Instruct', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config=None, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=None, guided_decoding_backend='auto', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_cascade_attn=False, disable_chunked_mm_input=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x7efb84154900>)
config.json: 660B [00:00, 6.11MB/s]
INFO 04-24 14:37:11 [config.py:689] This model supports multiple tasks: {'generate', 'classify', 'score', 'embed', 'reward'}. Defaulting to 'generate'.
INFO 04-24 14:37:11 [config.py:1901] Chunked prefill is enabled with max_num_batched_tokens=2048.
tokenizer_config.json: 7.30kB [00:00, 48.4MB/s]
vocab.json: 2.78MB [00:00, 3.17MB/s]
merges.txt: 1.67MB [00:00, 6.09MB/s]
tokenizer.json: 7.03MB [00:00, 8.92MB/s]
generation_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 242/242 [00:00<00:00, 3.55MB/s]
INFO 04-24 14:37:21 [__init__.py:239] Automatically detected platform cuda.
INFO 04-24 14:37:22 [core.py:61] Initializing a V1 LLM engine (v0.8.4) with config: model='Qwen/Qwen2.5-1.5B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-1.5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=Qwen/Qwen2.5-1.5B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
WARNING 04-24 14:37:23 [utils.py:2444] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f477d2e7e90>
[W424 14:37:24.343535930 socket.cpp:204] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W424 14:37:24.349196637 socket.cpp:204] [c10d] The hostname of the client socket cannot be retrieved. err=-3
INFO 04-24 14:37:24 [parallel_state.py:959] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 04-24 14:37:24 [cuda.py:221] Using Flash Attention backend on V1 engine.
INFO 04-24 14:37:24 [gpu_model_runner.py:1276] Starting to load model Qwen/Qwen2.5-1.5B-Instruct...
WARNING 04-24 14:37:24 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
INFO 04-24 14:37:25 [weight_utils.py:265] Using model weights format ['*.safetensors']
model.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.09G/3.09G [10:03<00:00, 5.12MB/s]
INFO 04-24 14:47:29 [weight_utils.py:281] Time spent downloading weights for Qwen/Qwen2.5-1.5B-Instruct: 604.447165 seconds
INFO 04-24 14:47:30 [weight_utils.py:315] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.23it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.22it/s]

INFO 04-24 14:47:31 [loader.py:458] Loading weights took 0.50 seconds
INFO 04-24 14:47:31 [gpu_model_runner.py:1291] Model loading took 2.8871 GiB and 607.113554 seconds
INFO 04-24 14:47:37 [backends.py:416] Using cache directory: /root/.cache/vllm/torch_compile_cache/e5f5b1e065/rank_0_0 for vLLM's torch.compile
INFO 04-24 14:47:37 [backends.py:426] Dynamo bytecode transform time: 5.43 s
INFO 04-24 14:47:39 [backends.py:132] Cache the graph of shape None for later use
INFO 04-24 14:47:56 [backends.py:144] Compiling a graph for general shape takes 18.81 s
INFO 04-24 14:48:05 [monitor.py:33] torch.compile takes 24.24 s in total
INFO 04-24 14:48:06 [kv_cache_utils.py:634] GPU KV cache size: 3,026,608 tokens
INFO 04-24 14:48:06 [kv_cache_utils.py:637] Maximum concurrency for 32,768 tokens per request: 92.36x
INFO 04-24 14:48:20 [gpu_model_runner.py:1626] Graph capturing finished in 15 secs, took 0.56 GiB
INFO 04-24 14:48:20 [core.py:163] init engine (profile, create kv cache, warmup model) took 48.97 seconds
INFO 04-24 14:48:20 [core_client.py:435] Core engine process 0 ready.
WARNING 04-24 14:48:21 [config.py:1177] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
INFO 04-24 14:48:21 [serving_chat.py:118] Using default chat sampling params from model: {'repetition_penalty': 1.1, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
INFO 04-24 14:48:21 [serving_completion.py:61] Using default completion sampling params from model: {'repetition_penalty': 1.1, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
INFO 04-24 14:48:21 [api_server.py:1081] Starting vLLM API server on http://0.0.0.0:8000
INFO 04-24 14:48:21 [launcher.py:26] Available routes are:
INFO 04-24 14:48:21 [launcher.py:34] Route: /openapi.json, Methods: GET, HEAD
INFO 04-24 14:48:21 [launcher.py:34] Route: /docs, Methods: GET, HEAD
INFO 04-24 14:48:21 [launcher.py:34] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 04-24 14:48:21 [launcher.py:34] Route: /redoc, Methods: GET, HEAD
INFO 04-24 14:48:21 [launcher.py:34] Route: /health, Methods: GET
INFO 04-24 14:48:21 [launcher.py:34] Route: /load, Methods: GET
INFO 04-24 14:48:21 [launcher.py:34] Route: /ping, Methods: GET, POST
INFO 04-24 14:48:21 [launcher.py:34] Route: /tokenize, Methods: POST
INFO 04-24 14:48:21 [launcher.py:34] Route: /detokenize, Methods: POST
INFO 04-24 14:48:21 [launcher.py:34] Route: /v1/models, Methods: GET
INFO 04-24 14:48:21 [launcher.py:34] Route: /version, Methods: GET
INFO 04-24 14:48:21 [launcher.py:34] Route: /v1/chat/completions, Methods: POST
INFO 04-24 14:48:21 [launcher.py:34] Route: /v1/completions, Methods: POST
INFO 04-24 14:48:21 [launcher.py:34] Route: /v1/embeddings, Methods: POST
INFO 04-24 14:48:21 [launcher.py:34] Route: /pooling, Methods: POST
INFO 04-24 14:48:21 [launcher.py:34] Route: /score, Methods: POST
INFO 04-24 14:48:21 [launcher.py:34] Route: /v1/score, Methods: POST
INFO 04-24 14:48:21 [launcher.py:34] Route: /v1/audio/transcriptions, Methods: POST
INFO 04-24 14:48:21 [launcher.py:34] Route: /rerank, Methods: POST
INFO 04-24 14:48:21 [launcher.py:34] Route: /v1/rerank, Methods: POST
INFO 04-24 14:48:21 [launcher.py:34] Route: /v2/rerank, Methods: POST
INFO 04-24 14:48:21 [launcher.py:34] Route: /invocations, Methods: POST
INFO:     Started server process [246514]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO 04-24 14:48:31 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-24 14:48:41 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-24 14:48:51 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-24 14:49:01 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-24 14:49:11 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-24 14:49:21 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-24 14:49:31 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-24 14:49:41 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-24 14:49:51 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-24 14:50:01 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-24 14:50:11 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-24 14:50:21 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-24 14:50:31 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-24 14:50:41 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-24 14:50:51 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-24 14:51:01 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-24 14:51:11 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-24 14:51:21 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-24 14:51:31 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-24 14:51:41 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-24 14:51:51 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-24 14:52:01 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-24 14:52:11 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-24 14:52:21 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

其中有一行提示“INFO 04-24 14:48:21 [api_server.py:1081] Starting vLLM API server on http://0.0.0.0:8000

说明了监听ip地址和端口号

vllm serve运行大模型并指定使用那几张显卡

先用nvidia-smi查询当前主机安装了哪些NVIDIA显卡,如下图,可以看出,该主机总共安装了8张H20显卡,其中编号为0和1的显卡正在被使用,还有6张显卡空闲着。

image.png

vllm运行大模型,并指定使用编号为2和3这两张显卡

注意:如果不加上CUDA_VISIBLE_DEVICES=2,3,则运行大模型会报错,提示没有可用的显存了,这是因为vllm默认从低编号的显卡开始加载模型,而且不会判断显卡是否已经被占用。另外要确保端口号与目前正在运行的vllm实例不同,默认的端口号为8000

CUDA_VISIBLE_DEVICES=2,3,4,5 vllm serve Qwen/Qwen2.5-72B-Instruct  --tensor-parallel-size 4 --api-key token-iamtornado888 --port 8001

输出如下:

CUDA_VISIBLE_DEVICES=2,3,4,5 vllm serve Qwen/Qwen2.5-72B-Instruct  --tensor-parallel-size 4 --api-key token-iamtornado888  --port 8001
INFO 04-27 17:08:39 [__init__.py:239] Automatically detected platform cuda.
INFO 04-27 17:08:40 [api_server.py:1034] vLLM API server version 0.8.4
INFO 04-27 17:08:40 [api_server.py:1035] args: Namespace(subparser='serve', model_tag='Qwen/Qwen2.5-72B-Instruct', config='', host=None, port=8001, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key='token-iamtornado888', lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='Qwen/Qwen2.5-72B-Instruct', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config=None, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=None, guided_decoding_backend='auto', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=4, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_cascade_attn=False, disable_chunked_mm_input=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x7fd6c91dc900>)
                                                                                                                                                                                                                                                                                             INFO 04-27 17:08:47 [config.py:689] This model supports multiple tasks: {'reward', 'classify', 'embed', 'generate', 'score'}. Defaulting to 'generate'.█▍                                                                                                | 2.15G/4.00G [23:18<2:02:51, 250kB/s]
INFO 04-27 17:08:47 [config.py:1713] Defaulting to use mp for distributed inference
INFO 04-27 17:08:47 [config.py:1901] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 04-27 17:08:51 [__init__.py:239] Automatically detected platform cuda.
INFO 04-27 17:08:53 [core.py:61] Initializing a V1 LLM engine (v0.8.4) with config: model='Qwen/Qwen2.5-72B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-72B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=Qwen/Qwen2.5-72B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
WARNING 04-27 17:08:53 [multiproc_worker_utils.py:306] Reducing Torch parallelism from 96 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 04-27 17:08:53 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1, 2, 3], buffer_handle=(4, 10485760, 10, 'psm_34724d94'), local_subscribe_addr='ipc:///tmp/03333ed6-28f6-40ee-8fb0-98f0f5cf8032', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 04-27 17:08:55 [__init__.py:239] Automatically detected platform cuda.
WARNING 04-27 17:08:58 [utils.py:2444] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7fd8513e8980>
(VllmWorker rank=0 pid=616882) INFO 04-27 17:08:58 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_2b021bfc'), local_subscribe_addr='ipc:///tmp/e84d8350-6aa5-4028-85dc-409afbdacde3', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 04-27 17:09:00 [__init__.py:239] Automatically detected platform cuda.
WARNING 04-27 17:09:02 [utils.py:2444] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f13ee03ee10>
(VllmWorker rank=1 pid=616966) INFO 04-27 17:09:02 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_5fff2aed'), local_subscribe_addr='ipc:///tmp/68f5878d-f445-4133-bf24-c3ad27e420c6', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 04-27 17:09:05 [__init__.py:239] Automatically detected platform cuda.
WARNING 04-27 17:09:07 [utils.py:2444] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f77dcd319d0>
(VllmWorker rank=2 pid=617161) INFO 04-27 17:09:07 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_dd07ebc5'), local_subscribe_addr='ipc:///tmp/b1d56126-1bad-4d9a-a454-991f4bc85e9e', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 04-27 17:09:09 [__init__.py:239] Automatically detected platform cuda.
                                                                                                                                                                                                                                                                                             WARNING 04-27 17:09:12 [utils.py:2444] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7fb69d2219d0>                                   | 2.16G/4.00G [23:45<2:01:48, 251kB/s]
(VllmWorker rank=3 pid=617224) INFO 04-27 17:09:12 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_2538b637'), local_subscribe_addr='ipc:///tmp/9bba8efc-f829-407d-893c-899a02e40165', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=2 pid=617161) INFO 04-27 17:09:12 [utils.py:993] Found nccl from library libnccl.so.2
(VllmWorker rank=3 pid=617224) INFO 04-27 17:09:12 [utils.py:993] Found nccl from library libnccl.so.2
(VllmWorker rank=0 pid=616882) INFO 04-27 17:09:12 [utils.py:993] Found nccl from library libnccl.so.2
(VllmWorker rank=1 pid=616966) INFO 04-27 17:09:12 [utils.py:993] Found nccl from library libnccl.so.2
(VllmWorker rank=3 pid=617224) INFO 04-27 17:09:12 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorker rank=2 pid=617161) INFO 04-27 17:09:12 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorker rank=1 pid=616966) INFO 04-27 17:09:12 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorker rank=0 pid=616882) INFO 04-27 17:09:12 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorker rank=3 pid=617224) INFO 04-27 17:09:14 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_2,3,4,5.json
(VllmWorker rank=1 pid=616966) INFO 04-27 17:09:14 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_2,3,4,5.json
(VllmWorker rank=0 pid=616882) INFO 04-27 17:09:14 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_2,3,4,5.json
(VllmWorker rank=2 pid=617161) INFO 04-27 17:09:14 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_2,3,4,5.json
(VllmWorker rank=0 pid=616882) INFO 04-27 17:09:14 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[1, 2, 3], buffer_handle=(3, 4194304, 6, 'psm_0d691ac6'), local_subscribe_addr='ipc:///tmp/e3bfcc03-b48d-457c-a219-9eba48507039', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=3 pid=617224) INFO 04-27 17:09:14 [parallel_state.py:959] rank 3 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 3
(VllmWorker rank=2 pid=617161) INFO 04-27 17:09:14 [parallel_state.py:959] rank 2 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 2
(VllmWorker rank=0 pid=616882) INFO 04-27 17:09:14 [parallel_state.py:959] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 0
(VllmWorker rank=1 pid=616966) INFO 04-27 17:09:14 [parallel_state.py:959] rank 1 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 1
(VllmWorker rank=3 pid=617224) INFO 04-27 17:09:14 [cuda.py:221] Using Flash Attention backend on V1 engine.
(VllmWorker rank=2 pid=617161) INFO 04-27 17:09:14 [cuda.py:221] Using Flash Attention backend on V1 engine.
(VllmWorker rank=0 pid=616882) INFO 04-27 17:09:14 [cuda.py:221] Using Flash Attention backend on V1 engine.
(VllmWorker rank=1 pid=616966) INFO 04-27 17:09:14 [cuda.py:221] Using Flash Attention backend on V1 engine.
(VllmWorker rank=3 pid=617224) INFO 04-27 17:09:14 [gpu_model_runner.py:1276] Starting to load model Qwen/Qwen2.5-72B-Instruct...
(VllmWorker rank=1 pid=616966) INFO 04-27 17:09:14 [gpu_model_runner.py:1276] Starting to load model Qwen/Qwen2.5-72B-Instruct...
(VllmWorker rank=2 pid=617161) INFO 04-27 17:09:14 [gpu_model_runner.py:1276] Starting to load model Qwen/Qwen2.5-72B-Instruct...
(VllmWorker rank=0 pid=616882) INFO 04-27 17:09:14 [gpu_model_runner.py:1276] Starting to load model Qwen/Qwen2.5-72B-Instruct...
(VllmWorker rank=3 pid=617224) WARNING 04-27 17:09:14 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(VllmWorker rank=0 pid=616882) WARNING 04-27 17:09:14 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(VllmWorker rank=2 pid=617161) WARNING 04-27 17:09:14 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(VllmWorker rank=1 pid=616966) WARNING 04-27 17:09:14 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(VllmWorker rank=3 pid=617224) INFO 04-27 17:09:15 [weight_utils.py:265] Using model weights format ['*.safetensors']
(VllmWorker rank=1 pid=616966) INFO 04-27 17:09:15 [weight_utils.py:265] Using model weights format ['*.safetensors']
(VllmWorker rank=0 pid=616882) INFO 04-27 17:09:15 [weight_utils.py:265] Using model weights format ['*.safetensors']
(VllmWorker rank=2 pid=617161) INFO 04-27 17:09:15 [weight_utils.py:265] Using model weights format ['*.safetensors']
model-00025-of-00037.safetensors:  86%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                            | 3.45G/4.00G [1:45:07<16:37, 547kB/s]
model-00025-of-00037.safetensors:  86%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                             | 3.45G/4.00G [00:00<?, ?B/s]/root/miniconda3/envs/myenv/lib/python3.12/multiprocessing/resource_tracker.py:255: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
model-00025-of-00037.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.00G/4.00G [00:27<00:00, 20.0MB/s]
(VllmWorker rank=3 pid=617224) INFO 04-27 18:31:01 [weight_utils.py:281] Time spent downloading weights for Qwen/Qwen2.5-72B-Instruct: 28.496084 seconds
model.safetensors.index.json: 79.0kB [00:00, 51.5MB/s]
Loading safetensors checkpoint shards:   0% Completed | 0/37 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   3% Completed | 1/37 [00:00<00:08,  4.23it/s]
Loading safetensors checkpoint shards:   5% Completed | 2/37 [00:00<00:08,  3.95it/s]
Loading safetensors checkpoint shards:   8% Completed | 3/37 [00:00<00:08,  3.85it/s]
Loading safetensors checkpoint shards:  11% Completed | 4/37 [00:01<00:08,  3.81it/s]
Loading safetensors checkpoint shards:  14% Completed | 5/37 [00:01<00:08,  3.79it/s]
Loading safetensors checkpoint shards:  16% Completed | 6/37 [00:01<00:08,  3.61it/s]
Loading safetensors checkpoint shards:  19% Completed | 7/37 [00:01<00:08,  3.64it/s]
Loading safetensors checkpoint shards:  22% Completed | 8/37 [00:02<00:07,  3.66it/s]
Loading safetensors checkpoint shards:  24% Completed | 9/37 [00:02<00:07,  3.54it/s]
Loading safetensors checkpoint shards:  27% Completed | 10/37 [00:02<00:07,  3.47it/s]
Loading safetensors checkpoint shards:  30% Completed | 11/37 [00:03<00:07,  3.54it/s]
Loading safetensors checkpoint shards:  32% Completed | 12/37 [00:03<00:06,  3.61it/s]
Loading safetensors checkpoint shards:  35% Completed | 13/37 [00:03<00:06,  3.69it/s]
Loading safetensors checkpoint shards:  38% Completed | 14/37 [00:03<00:06,  3.76it/s]
Loading safetensors checkpoint shards:  41% Completed | 15/37 [00:04<00:05,  3.80it/s]
Loading safetensors checkpoint shards:  43% Completed | 16/37 [00:04<00:05,  3.79it/s]
Loading safetensors checkpoint shards:  46% Completed | 17/37 [00:04<00:05,  3.41it/s]
Loading safetensors checkpoint shards:  49% Completed | 18/37 [00:05<00:05,  3.22it/s]
Loading safetensors checkpoint shards:  51% Completed | 19/37 [00:05<00:06,  2.95it/s]
Loading safetensors checkpoint shards:  54% Completed | 20/37 [00:05<00:05,  2.86it/s]
Loading safetensors checkpoint shards:  57% Completed | 21/37 [00:06<00:05,  2.86it/s]
Loading safetensors checkpoint shards:  59% Completed | 22/37 [00:06<00:05,  2.80it/s]
Loading safetensors checkpoint shards:  62% Completed | 23/37 [00:06<00:05,  2.68it/s]
Loading safetensors checkpoint shards:  65% Completed | 24/37 [00:07<00:04,  2.78it/s]
Loading safetensors checkpoint shards:  68% Completed | 25/37 [00:07<00:04,  2.78it/s]
Loading safetensors checkpoint shards:  70% Completed | 26/37 [00:07<00:03,  2.83it/s]
Loading safetensors checkpoint shards:  73% Completed | 27/37 [00:08<00:03,  2.96it/s]
Loading safetensors checkpoint shards:  76% Completed | 28/37 [00:08<00:03,  2.88it/s]
(VllmWorker rank=3 pid=617224) INFO 04-27 18:31:11 [loader.py:458] Loading weights took 9.89 seconds
Loading safetensors checkpoint shards:  78% Completed | 29/37 [00:08<00:02,  3.00it/s]
(VllmWorker rank=3 pid=617224) INFO 04-27 18:31:11 [gpu_model_runner.py:1291] Model loading took 33.9835 GiB and 4916.997663 seconds
Loading safetensors checkpoint shards:  81% Completed | 30/37 [00:09<00:02,  2.98it/s]
Loading safetensors checkpoint shards:  84% Completed | 31/37 [00:09<00:01,  3.05it/s]
Loading safetensors checkpoint shards:  86% Completed | 32/37 [00:09<00:01,  3.14it/s]
Loading safetensors checkpoint shards:  89% Completed | 33/37 [00:10<00:01,  3.21it/s]
Loading safetensors checkpoint shards:  92% Completed | 34/37 [00:10<00:00,  3.11it/s]
Loading safetensors checkpoint shards:  95% Completed | 35/37 [00:10<00:00,  3.45it/s]
Loading safetensors checkpoint shards:  97% Completed | 36/37 [00:10<00:00,  3.58it/s]
Loading safetensors checkpoint shards: 100% Completed | 37/37 [00:11<00:00,  3.84it/s]
Loading safetensors checkpoint shards: 100% Completed | 37/37 [00:11<00:00,  3.30it/s]
(VllmWorker rank=0 pid=616882)
(VllmWorker rank=0 pid=616882) INFO 04-27 18:31:14 [loader.py:458] Loading weights took 11.25 seconds
(VllmWorker rank=1 pid=616966) INFO 04-27 18:31:14 [loader.py:458] Loading weights took 10.63 seconds
(VllmWorker rank=2 pid=617161) INFO 04-27 18:31:14 [loader.py:458] Loading weights took 11.85 seconds
(VllmWorker rank=0 pid=616882) INFO 04-27 18:31:14 [gpu_model_runner.py:1291] Model loading took 33.9835 GiB and 4919.544812 seconds
(VllmWorker rank=2 pid=617161) INFO 04-27 18:31:14 [gpu_model_runner.py:1291] Model loading took 33.9835 GiB and 4919.549431 seconds
(VllmWorker rank=1 pid=616966) INFO 04-27 18:31:14 [gpu_model_runner.py:1291] Model loading took 33.9835 GiB and 4919.549516 seconds
(VllmWorker rank=0 pid=616882) INFO 04-27 18:31:28 [backends.py:416] Using cache directory: /root/.cache/vllm/torch_compile_cache/1223775e68/rank_0_0 for vLLM's torch.compile
(VllmWorker rank=0 pid=616882) INFO 04-27 18:31:28 [backends.py:426] Dynamo bytecode transform time: 14.58 s
(VllmWorker rank=3 pid=617224) INFO 04-27 18:31:29 [backends.py:416] Using cache directory: /root/.cache/vllm/torch_compile_cache/1223775e68/rank_3_0 for vLLM's torch.compile
(VllmWorker rank=3 pid=617224) INFO 04-27 18:31:29 [backends.py:426] Dynamo bytecode transform time: 14.78 s
(VllmWorker rank=2 pid=617161) INFO 04-27 18:31:29 [backends.py:416] Using cache directory: /root/.cache/vllm/torch_compile_cache/1223775e68/rank_2_0 for vLLM's torch.compile
(VllmWorker rank=2 pid=617161) INFO 04-27 18:31:29 [backends.py:426] Dynamo bytecode transform time: 14.83 s
(VllmWorker rank=1 pid=616966) INFO 04-27 18:31:29 [backends.py:416] Using cache directory: /root/.cache/vllm/torch_compile_cache/1223775e68/rank_1_0 for vLLM's torch.compile
(VllmWorker rank=1 pid=616966) INFO 04-27 18:31:29 [backends.py:426] Dynamo bytecode transform time: 14.96 s
(VllmWorker rank=0 pid=616882) INFO 04-27 18:31:31 [backends.py:132] Cache the graph of shape None for later use
(VllmWorker rank=3 pid=617224) INFO 04-27 18:31:32 [backends.py:132] Cache the graph of shape None for later use
(VllmWorker rank=2 pid=617161) INFO 04-27 18:31:32 [backends.py:132] Cache the graph of shape None for later use
(VllmWorker rank=1 pid=616966) INFO 04-27 18:31:32 [backends.py:132] Cache the graph of shape None for later use
(VllmWorker rank=0 pid=616882) INFO 04-27 18:32:18 [backends.py:144] Compiling a graph for general shape takes 48.49 s
(VllmWorker rank=3 pid=617224) INFO 04-27 18:32:19 [backends.py:144] Compiling a graph for general shape takes 49.27 s
(VllmWorker rank=2 pid=617161) INFO 04-27 18:32:19 [backends.py:144] Compiling a graph for general shape takes 49.25 s
(VllmWorker rank=1 pid=616966) INFO 04-27 18:32:19 [backends.py:144] Compiling a graph for general shape takes 49.85 s
(VllmWorker rank=2 pid=617161) INFO 04-27 18:32:46 [monitor.py:33] torch.compile takes 64.08 s in total
(VllmWorker rank=3 pid=617224) INFO 04-27 18:32:46 [monitor.py:33] torch.compile takes 64.05 s in total
(VllmWorker rank=1 pid=616966) INFO 04-27 18:32:46 [monitor.py:33] torch.compile takes 64.81 s in total
(VllmWorker rank=0 pid=616882) INFO 04-27 18:32:46 [monitor.py:33] torch.compile takes 63.06 s in total
INFO 04-27 18:32:48 [kv_cache_utils.py:634] GPU KV cache size: 601,072 tokens
INFO 04-27 18:32:48 [kv_cache_utils.py:637] Maximum concurrency for 32,768 tokens per request: 18.34x
INFO 04-27 18:32:48 [kv_cache_utils.py:634] GPU KV cache size: 599,840 tokens
INFO 04-27 18:32:48 [kv_cache_utils.py:637] Maximum concurrency for 32,768 tokens per request: 18.31x
INFO 04-27 18:32:48 [kv_cache_utils.py:634] GPU KV cache size: 599,840 tokens
INFO 04-27 18:32:48 [kv_cache_utils.py:637] Maximum concurrency for 32,768 tokens per request: 18.31x
INFO 04-27 18:32:48 [kv_cache_utils.py:634] GPU KV cache size: 605,984 tokens
INFO 04-27 18:32:48 [kv_cache_utils.py:637] Maximum concurrency for 32,768 tokens per request: 18.49x
(VllmWorker rank=0 pid=616882) INFO 04-27 18:33:26 [custom_all_reduce.py:195] Registering 10626 cuda graph addresses
(VllmWorker rank=3 pid=617224) INFO 04-27 18:33:26 [custom_all_reduce.py:195] Registering 10626 cuda graph addresses
(VllmWorker rank=2 pid=617161) INFO 04-27 18:33:26 [custom_all_reduce.py:195] Registering 10626 cuda graph addresses
(VllmWorker rank=1 pid=616966) INFO 04-27 18:33:26 [custom_all_reduce.py:195] Registering 10626 cuda graph addresses
(VllmWorker rank=3 pid=617224) INFO 04-27 18:33:27 [gpu_model_runner.py:1626] Graph capturing finished in 38 secs, took 1.34 GiB
(VllmWorker rank=1 pid=616966) INFO 04-27 18:33:27 [gpu_model_runner.py:1626] Graph capturing finished in 38 secs, took 1.34 GiB
(VllmWorker rank=2 pid=617161) INFO 04-27 18:33:27 [gpu_model_runner.py:1626] Graph capturing finished in 38 secs, took 1.34 GiB
(VllmWorker rank=0 pid=616882) INFO 04-27 18:33:27 [gpu_model_runner.py:1626] Graph capturing finished in 38 secs, took 1.34 GiB
INFO 04-27 18:33:27 [core.py:163] init engine (profile, create kv cache, warmup model) took 132.73 seconds
INFO 04-27 18:33:27 [core_client.py:435] Core engine process 0 ready.
WARNING 04-27 18:33:27 [config.py:1177] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
INFO 04-27 18:33:27 [serving_chat.py:118] Using default chat sampling params from model: {'repetition_penalty': 1.05, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
INFO 04-27 18:33:27 [serving_completion.py:61] Using default completion sampling params from model: {'repetition_penalty': 1.05, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
INFO 04-27 18:33:27 [api_server.py:1081] Starting vLLM API server on http://0.0.0.0:8001
INFO 04-27 18:33:27 [launcher.py:26] Available routes are:
INFO 04-27 18:33:27 [launcher.py:34] Route: /openapi.json, Methods: GET, HEAD
INFO 04-27 18:33:27 [launcher.py:34] Route: /docs, Methods: GET, HEAD
INFO 04-27 18:33:27 [launcher.py:34] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 04-27 18:33:27 [launcher.py:34] Route: /redoc, Methods: GET, HEAD
INFO 04-27 18:33:27 [launcher.py:34] Route: /health, Methods: GET
INFO 04-27 18:33:27 [launcher.py:34] Route: /load, Methods: GET
INFO 04-27 18:33:27 [launcher.py:34] Route: /ping, Methods: GET, POST
INFO 04-27 18:33:27 [launcher.py:34] Route: /tokenize, Methods: POST
INFO 04-27 18:33:27 [launcher.py:34] Route: /detokenize, Methods: POST
INFO 04-27 18:33:27 [launcher.py:34] Route: /v1/models, Methods: GET
INFO 04-27 18:33:27 [launcher.py:34] Route: /version, Methods: GET
INFO 04-27 18:33:27 [launcher.py:34] Route: /v1/chat/completions, Methods: POST
INFO 04-27 18:33:27 [launcher.py:34] Route: /v1/completions, Methods: POST
INFO 04-27 18:33:27 [launcher.py:34] Route: /v1/embeddings, Methods: POST
INFO 04-27 18:33:27 [launcher.py:34] Route: /pooling, Methods: POST
INFO 04-27 18:33:27 [launcher.py:34] Route: /score, Methods: POST
INFO 04-27 18:33:27 [launcher.py:34] Route: /v1/score, Methods: POST
INFO 04-27 18:33:27 [launcher.py:34] Route: /v1/audio/transcriptions, Methods: POST
INFO 04-27 18:33:27 [launcher.py:34] Route: /rerank, Methods: POST
INFO 04-27 18:33:27 [launcher.py:34] Route: /v1/rerank, Methods: POST
INFO 04-27 18:33:27 [launcher.py:34] Route: /v2/rerank, Methods: POST
INFO 04-27 18:33:27 [launcher.py:34] Route: /invocations, Methods: POST
INFO:     Started server process [616168]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO 04-27 18:33:37 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-27 18:33:47 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-27 18:33:57 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-27 18:34:07 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-27 18:34:17 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-27 18:34:27 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-27 18:34:37 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-27 18:34:47 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-27 18:34:57 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-27 18:35:07 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-27 18:35:17 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-27 18:35:27 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-27 18:35:37 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-27 18:35:47 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-27 18:35:57 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-27 18:36:07 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-27 18:36:17 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-27 18:36:27 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-27 18:36:37 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-27 18:36:47 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-27 18:36:57 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-27 18:37:07 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-27 18:37:17 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-27 18:37:27 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-27 18:37:37 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-27 18:37:47 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-27 18:37:57 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-27 18:38:07 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-27 18:38:17 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-27 18:38:27 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-27 18:38:37 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-27 18:38:47 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-27 18:38:57 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-27 18:39:07 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-27 18:39:17 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-27 18:39:27 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-27 18:39:36 [chat_utils.py:396] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
INFO 04-27 18:39:36 [logger.py:39] Received request chatcmpl-6a5b286d15de4dd9a73148ee2f99a2bc: prompt: '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\n将这段英文翻译为中文:China made significant strides in intellectual property development in 2024, ranking 11th in the Global Innovation Index and recording notable achievements in patent grants, trademark registrations, and copyright filings, according to a press conference Thursday.<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.05, temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=32685, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO 04-27 18:39:36 [async_llm.py:228] Added request chatcmpl-6a5b286d15de4dd9a73148ee2f99a2bc.
INFO 04-27 18:39:37 [loggers.py:87] Engine 000: Avg prompt throughput: 8.3 tokens/s, Avg generation throughput: 4.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO:     112.91.183.186:52487 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 04-27 18:39:47 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

验证:

再次用nvidia-smi查询显卡状态

image.png

利用api与大模型对话:

curl http://115.190.31.27:8001/v1/chat/completions \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer token-iamtornado888" \
    -d '{
        "model": "Qwen/Qwen2.5-72B-Instruct",
        "messages": [
            {"role": "user", "content": "将这段英文翻译为中文:China made significant strides in intellectual property development in 2024, ranking 11th in the Global Innovation Index and recording notable achievements in patent grants, trademark registrations, and copyright filings, according to a press conference Thursday."}
        ]
    }' | jq .

image.png

验证(vllm常用endpoint)

vllm baseurl

成功运行vllm命令后,其baseurl为http://your_server_ip:8000/v1

curl命令查询models列表

curl http://115.190.31.27:8000/v1/models | jq .

image.png

curl与模型对话

curl http://115.190.31.27:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen2.5-1.5B-Instruct",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "who are you?"}
        ]
    }' | jq .

image.png

用openai的模块与大模型对话

因为vllm兼容openai的api格式的,所以可以用openai这个库与千问大模型对话

用vllm运行大模型,添加api-key

vllm serve Qwen/QwQ-32B  --tensor-parallel-size 8 --api-key token-iamtornado888

输出如下:

vllm serve Qwen/QwQ-32B  --tensor-parallel-size 8
INFO 04-27 08:52:34 [__init__.py:239] Automatically detected platform cuda.
INFO 04-27 08:52:35 [api_server.py:1034] vLLM API server version 0.8.4
INFO 04-27 08:52:35 [api_server.py:1035] args: Namespace(subparser='serve', model_tag='Qwen/QwQ-32B', config='', host=None, port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='Qwen/QwQ-32B', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config=None, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=None, guided_decoding_backend='auto', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=8, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_cascade_attn=False, disable_chunked_mm_input=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x7f76b1764900>)
INFO 04-27 08:52:42 [config.py:689] This model supports multiple tasks: {'classify', 'embed', 'generate', 'score', 'reward'}. Defaulting to 'generate'.
INFO 04-27 08:52:42 [config.py:1713] Defaulting to use mp for distributed inference
INFO 04-27 08:52:42 [config.py:1901] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 04-27 08:52:46 [__init__.py:239] Automatically detected platform cuda.
INFO 04-27 08:52:48 [core.py:61] Initializing a V1 LLM engine (v0.8.4) with config: model='Qwen/QwQ-32B', speculative_config=None, tokenizer='Qwen/QwQ-32B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=40960, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=8, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=Qwen/QwQ-32B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
WARNING 04-27 08:52:48 [multiproc_worker_utils.py:306] Reducing Torch parallelism from 96 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 04-27 08:52:48 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1, 2, 3, 4, 5, 6, 7], buffer_handle=(8, 10485760, 10, 'psm_b715c377'), local_subscribe_addr='ipc:///tmp/a12fcab2-4aca-4a91-ab0f-0bffc83f0c26', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 04-27 08:52:50 [__init__.py:239] Automatically detected platform cuda.
WARNING 04-27 08:52:53 [utils.py:2444] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7fc08f0df500>
(VllmWorker rank=0 pid=4161236) INFO 04-27 08:52:53 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_dd4defaa'), local_subscribe_addr='ipc:///tmp/a87a6371-cabb-4202-8d54-8167799974aa', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 04-27 08:52:55 [__init__.py:239] Automatically detected platform cuda.
WARNING 04-27 08:52:57 [utils.py:2444] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7fcf4eb5c680>
(VllmWorker rank=1 pid=4161367) INFO 04-27 08:52:57 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_f73189ef'), local_subscribe_addr='ipc:///tmp/83e03a78-ebc8-4549-8153-c5645d096eaf', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 04-27 08:53:00 [__init__.py:239] Automatically detected platform cuda.
WARNING 04-27 08:53:02 [utils.py:2444] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7fe7793eb710>
(VllmWorker rank=2 pid=4161609) INFO 04-27 08:53:02 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_93f29e51'), local_subscribe_addr='ipc:///tmp/6a8ad3c7-394d-4d79-bdcb-1a5642a8952e', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 04-27 08:53:05 [__init__.py:239] Automatically detected platform cuda.
WARNING 04-27 08:53:07 [utils.py:2444] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f8903fc4740>
(VllmWorker rank=3 pid=4161678) INFO 04-27 08:53:07 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_f77d7284'), local_subscribe_addr='ipc:///tmp/646f368e-039e-4e9a-a92f-6f570d4fc0d1', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 04-27 08:53:09 [__init__.py:239] Automatically detected platform cuda.
WARNING 04-27 08:53:11 [utils.py:2444] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f23462beb10>
(VllmWorker rank=4 pid=4161735) INFO 04-27 08:53:11 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_1d274286'), local_subscribe_addr='ipc:///tmp/f445b8d5-4cca-408d-a5f1-a98cd0ed1e18', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 04-27 08:53:14 [__init__.py:239] Automatically detected platform cuda.
WARNING 04-27 08:53:16 [utils.py:2444] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f46fc9ca630>
(VllmWorker rank=5 pid=4161796) INFO 04-27 08:53:16 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_2560b976'), local_subscribe_addr='ipc:///tmp/29c712f9-30a6-444c-9152-5aec2f0f4533', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 04-27 08:53:19 [__init__.py:239] Automatically detected platform cuda.
WARNING 04-27 08:53:21 [utils.py:2444] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7fe40812e510>
(VllmWorker rank=6 pid=4161859) INFO 04-27 08:53:21 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_5560e3b0'), local_subscribe_addr='ipc:///tmp/555e07d8-48a8-4d8c-8042-1eabb648f60c', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 04-27 08:53:23 [__init__.py:239] Automatically detected platform cuda.
WARNING 04-27 08:53:25 [utils.py:2444] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f4c70f84920>
(VllmWorker rank=7 pid=4161915) INFO 04-27 08:53:25 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_9e5a8b9c'), local_subscribe_addr='ipc:///tmp/8bf5fa68-4010-4f28-a35d-84caa498833a', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=0 pid=4161236) INFO 04-27 08:53:26 [utils.py:993] Found nccl from library libnccl.so.2
(VllmWorker rank=0 pid=4161236) INFO 04-27 08:53:26 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorker rank=4 pid=4161735) INFO 04-27 08:53:26 [utils.py:993] Found nccl from library libnccl.so.2
(VllmWorker rank=7 pid=4161915) INFO 04-27 08:53:26 [utils.py:993] Found nccl from library libnccl.so.2
(VllmWorker rank=5 pid=4161796) INFO 04-27 08:53:26 [utils.py:993] Found nccl from library libnccl.so.2
(VllmWorker rank=4 pid=4161735) INFO 04-27 08:53:26 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorker rank=7 pid=4161915) INFO 04-27 08:53:26 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorker rank=5 pid=4161796) INFO 04-27 08:53:26 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorker rank=3 pid=4161678) INFO 04-27 08:53:26 [utils.py:993] Found nccl from library libnccl.so.2
(VllmWorker rank=2 pid=4161609) INFO 04-27 08:53:26 [utils.py:993] Found nccl from library libnccl.so.2
(VllmWorker rank=1 pid=4161367) INFO 04-27 08:53:26 [utils.py:993] Found nccl from library libnccl.so.2
(VllmWorker rank=3 pid=4161678) INFO 04-27 08:53:26 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorker rank=2 pid=4161609) INFO 04-27 08:53:26 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorker rank=1 pid=4161367) INFO 04-27 08:53:26 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorker rank=6 pid=4161859) INFO 04-27 08:53:26 [utils.py:993] Found nccl from library libnccl.so.2
(VllmWorker rank=6 pid=4161859) INFO 04-27 08:53:26 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorker rank=0 pid=4161236) INFO 04-27 08:53:32 [custom_all_reduce_utils.py:206] generating GPU P2P access cache in /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
(VllmWorker rank=0 pid=4161236) INFO 04-27 08:53:55 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
(VllmWorker rank=4 pid=4161735) INFO 04-27 08:53:55 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
(VllmWorker rank=5 pid=4161796) INFO 04-27 08:53:55 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
(VllmWorker rank=7 pid=4161915) INFO 04-27 08:53:55 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
(VllmWorker rank=2 pid=4161609) INFO 04-27 08:53:55 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
(VllmWorker rank=6 pid=4161859) INFO 04-27 08:53:55 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
(VllmWorker rank=1 pid=4161367) INFO 04-27 08:53:55 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
(VllmWorker rank=3 pid=4161678) INFO 04-27 08:53:55 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3,4,5,6,7.json
(VllmWorker rank=0 pid=4161236) INFO 04-27 08:53:55 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[1, 2, 3, 4, 5, 6, 7], buffer_handle=(7, 4194304, 6, 'psm_19b3c8f8'), local_subscribe_addr='ipc:///tmp/b2bc59f7-c45b-4761-b551-5313e5e3aaa3', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=7 pid=4161915) INFO 04-27 08:53:55 [parallel_state.py:959] rank 7 in world size 8 is assigned as DP rank 0, PP rank 0, TP rank 7
(VllmWorker rank=5 pid=4161796) INFO 04-27 08:53:55 [parallel_state.py:959] rank 5 in world size 8 is assigned as DP rank 0, PP rank 0, TP rank 5
(VllmWorker rank=4 pid=4161735) INFO 04-27 08:53:55 [parallel_state.py:959] rank 4 in world size 8 is assigned as DP rank 0, PP rank 0, TP rank 4
(VllmWorker rank=3 pid=4161678) INFO 04-27 08:53:55 [parallel_state.py:959] rank 3 in world size 8 is assigned as DP rank 0, PP rank 0, TP rank 3
(VllmWorker rank=2 pid=4161609) INFO 04-27 08:53:55 [parallel_state.py:959] rank 2 in world size 8 is assigned as DP rank 0, PP rank 0, TP rank 2
(VllmWorker rank=6 pid=4161859) INFO 04-27 08:53:55 [parallel_state.py:959] rank 6 in world size 8 is assigned as DP rank 0, PP rank 0, TP rank 6
(VllmWorker rank=1 pid=4161367) INFO 04-27 08:53:55 [parallel_state.py:959] rank 1 in world size 8 is assigned as DP rank 0, PP rank 0, TP rank 1
(VllmWorker rank=0 pid=4161236) INFO 04-27 08:53:55 [parallel_state.py:959] rank 0 in world size 8 is assigned as DP rank 0, PP rank 0, TP rank 0
(VllmWorker rank=7 pid=4161915) INFO 04-27 08:53:55 [cuda.py:221] Using Flash Attention backend on V1 engine.
(VllmWorker rank=5 pid=4161796) INFO 04-27 08:53:55 [cuda.py:221] Using Flash Attention backend on V1 engine.
(VllmWorker rank=4 pid=4161735) INFO 04-27 08:53:55 [cuda.py:221] Using Flash Attention backend on V1 engine.
(VllmWorker rank=3 pid=4161678) INFO 04-27 08:53:55 [cuda.py:221] Using Flash Attention backend on V1 engine.
(VllmWorker rank=2 pid=4161609) INFO 04-27 08:53:55 [cuda.py:221] Using Flash Attention backend on V1 engine.
(VllmWorker rank=6 pid=4161859) INFO 04-27 08:53:55 [cuda.py:221] Using Flash Attention backend on V1 engine.
(VllmWorker rank=0 pid=4161236) INFO 04-27 08:53:55 [cuda.py:221] Using Flash Attention backend on V1 engine.
(VllmWorker rank=1 pid=4161367) INFO 04-27 08:53:55 [cuda.py:221] Using Flash Attention backend on V1 engine.
(VllmWorker rank=5 pid=4161796) INFO 04-27 08:53:55 [gpu_model_runner.py:1276] Starting to load model Qwen/QwQ-32B...
(VllmWorker rank=7 pid=4161915) INFO 04-27 08:53:55 [gpu_model_runner.py:1276] Starting to load model Qwen/QwQ-32B...
(VllmWorker rank=4 pid=4161735) INFO 04-27 08:53:55 [gpu_model_runner.py:1276] Starting to load model Qwen/QwQ-32B...
(VllmWorker rank=3 pid=4161678) INFO 04-27 08:53:55 [gpu_model_runner.py:1276] Starting to load model Qwen/QwQ-32B...
(VllmWorker rank=2 pid=4161609) INFO 04-27 08:53:55 [gpu_model_runner.py:1276] Starting to load model Qwen/QwQ-32B...
(VllmWorker rank=0 pid=4161236) INFO 04-27 08:53:55 [gpu_model_runner.py:1276] Starting to load model Qwen/QwQ-32B...
(VllmWorker rank=6 pid=4161859) INFO 04-27 08:53:55 [gpu_model_runner.py:1276] Starting to load model Qwen/QwQ-32B...
(VllmWorker rank=1 pid=4161367) INFO 04-27 08:53:55 [gpu_model_runner.py:1276] Starting to load model Qwen/QwQ-32B...
(VllmWorker rank=7 pid=4161915) WARNING 04-27 08:53:55 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(VllmWorker rank=5 pid=4161796) WARNING 04-27 08:53:55 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(VllmWorker rank=4 pid=4161735) WARNING 04-27 08:53:55 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(VllmWorker rank=3 pid=4161678) WARNING 04-27 08:53:55 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(VllmWorker rank=2 pid=4161609) WARNING 04-27 08:53:55 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(VllmWorker rank=0 pid=4161236) WARNING 04-27 08:53:55 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(VllmWorker rank=6 pid=4161859) WARNING 04-27 08:53:55 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(VllmWorker rank=1 pid=4161367) WARNING 04-27 08:53:55 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(VllmWorker rank=7 pid=4161915) INFO 04-27 08:53:56 [weight_utils.py:265] Using model weights format ['*.safetensors']
(VllmWorker rank=5 pid=4161796) INFO 04-27 08:53:56 [weight_utils.py:265] Using model weights format ['*.safetensors']
(VllmWorker rank=4 pid=4161735) INFO 04-27 08:53:56 [weight_utils.py:265] Using model weights format ['*.safetensors']
(VllmWorker rank=2 pid=4161609) INFO 04-27 08:53:56 [weight_utils.py:265] Using model weights format ['*.safetensors']
(VllmWorker rank=3 pid=4161678) INFO 04-27 08:53:56 [weight_utils.py:265] Using model weights format ['*.safetensors']
(VllmWorker rank=6 pid=4161859) INFO 04-27 08:53:56 [weight_utils.py:265] Using model weights format ['*.safetensors']
(VllmWorker rank=0 pid=4161236) INFO 04-27 08:53:56 [weight_utils.py:265] Using model weights format ['*.safetensors']
(VllmWorker rank=1 pid=4161367) INFO 04-27 08:53:56 [weight_utils.py:265] Using model weights format ['*.safetensors']
(VllmWorker rank=0 pid=4161236) INFO 04-27 08:53:59 [weight_utils.py:281] Time spent downloading weights for Qwen/QwQ-32B: 0.654292 seconds
Loading safetensors checkpoint shards:   0% Completed | 0/14 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   7% Completed | 1/14 [00:00<00:02,  5.56it/s]
Loading safetensors checkpoint shards:  14% Completed | 2/14 [00:00<00:02,  5.04it/s]
Loading safetensors checkpoint shards:  21% Completed | 3/14 [00:00<00:02,  5.22it/s]
(VllmWorker rank=3 pid=4161678) INFO 04-27 08:53:59 [weight_utils.py:281] Time spent downloading weights for Qwen/QwQ-32B: 0.606331 seconds
Loading safetensors checkpoint shards:  29% Completed | 4/14 [00:00<00:02,  4.87it/s]
Loading safetensors checkpoint shards:  36% Completed | 5/14 [00:01<00:01,  4.63it/s]
Loading safetensors checkpoint shards:  43% Completed | 6/14 [00:01<00:01,  4.57it/s]
(VllmWorker rank=7 pid=4161915) INFO 04-27 08:54:00 [loader.py:458] Loading weights took 3.54 seconds
(VllmWorker rank=6 pid=4161859) INFO 04-27 08:54:00 [loader.py:458] Loading weights took 3.09 seconds
(VllmWorker rank=4 pid=4161735) INFO 04-27 08:54:00 [loader.py:458] Loading weights took 2.82 seconds
Loading safetensors checkpoint shards:  50% Completed | 7/14 [00:01<00:01,  4.55it/s]
(VllmWorker rank=7 pid=4161915) INFO 04-27 08:54:00 [gpu_model_runner.py:1291] Model loading took 7.6878 GiB and 4.778883 seconds
(VllmWorker rank=6 pid=4161859) INFO 04-27 08:54:00 [gpu_model_runner.py:1291] Model loading took 7.6878 GiB and 4.772099 seconds
(VllmWorker rank=4 pid=4161735) INFO 04-27 08:54:00 [gpu_model_runner.py:1291] Model loading took 7.6878 GiB and 4.962865 seconds
Loading safetensors checkpoint shards:  57% Completed | 8/14 [00:01<00:01,  4.58it/s]
Loading safetensors checkpoint shards:  64% Completed | 9/14 [00:01<00:01,  4.55it/s]
(VllmWorker rank=2 pid=4161609) INFO 04-27 08:54:01 [loader.py:458] Loading weights took 2.83 seconds
Loading safetensors checkpoint shards:  71% Completed | 10/14 [00:02<00:00,  4.49it/s]
(VllmWorker rank=2 pid=4161609) INFO 04-27 08:54:01 [gpu_model_runner.py:1291] Model loading took 7.6878 GiB and 5.428892 seconds
Loading safetensors checkpoint shards:  79% Completed | 11/14 [00:02<00:00,  4.43it/s]
Loading safetensors checkpoint shards:  86% Completed | 12/14 [00:02<00:00,  4.44it/s]
Loading safetensors checkpoint shards:  93% Completed | 13/14 [00:02<00:00,  4.49it/s]
Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:02<00:00,  4.80it/s]
(VllmWorker rank=0 pid=4161236)
(VllmWorker rank=0 pid=4161236) INFO 04-27 08:54:02 [loader.py:458] Loading weights took 2.93 seconds
(VllmWorker rank=0 pid=4161236) INFO 04-27 08:54:02 [gpu_model_runner.py:1291] Model loading took 7.6878 GiB and 6.429508 seconds
(VllmWorker rank=3 pid=4161678) INFO 04-27 08:54:02 [loader.py:458] Loading weights took 2.82 seconds
(VllmWorker rank=3 pid=4161678) INFO 04-27 08:54:03 [gpu_model_runner.py:1291] Model loading took 7.6878 GiB and 7.178118 seconds
(VllmWorker rank=1 pid=4161367) INFO 04-27 08:54:03 [loader.py:458] Loading weights took 2.94 seconds
(VllmWorker rank=5 pid=4161796) INFO 04-27 08:54:03 [loader.py:458] Loading weights took 2.72 seconds
(VllmWorker rank=1 pid=4161367) INFO 04-27 08:54:03 [gpu_model_runner.py:1291] Model loading took 7.6878 GiB and 7.759082 seconds
(VllmWorker rank=5 pid=4161796) INFO 04-27 08:54:03 [gpu_model_runner.py:1291] Model loading took 7.6878 GiB and 8.016376 seconds
(VllmWorker rank=7 pid=4161915) INFO 04-27 08:54:15 [backends.py:416] Using cache directory: /root/.cache/vllm/torch_compile_cache/f23075ea50/rank_7_0 for vLLM's torch.compile
(VllmWorker rank=5 pid=4161796) INFO 04-27 08:54:15 [backends.py:416] Using cache directory: /root/.cache/vllm/torch_compile_cache/f23075ea50/rank_5_0 for vLLM's torch.compile
(VllmWorker rank=7 pid=4161915) INFO 04-27 08:54:15 [backends.py:426] Dynamo bytecode transform time: 11.73 s
(VllmWorker rank=5 pid=4161796) INFO 04-27 08:54:15 [backends.py:426] Dynamo bytecode transform time: 11.73 s
(VllmWorker rank=4 pid=4161735) INFO 04-27 08:54:15 [backends.py:416] Using cache directory: /root/.cache/vllm/torch_compile_cache/f23075ea50/rank_4_0 for vLLM's torch.compile
(VllmWorker rank=4 pid=4161735) INFO 04-27 08:54:15 [backends.py:426] Dynamo bytecode transform time: 11.74 s
(VllmWorker rank=1 pid=4161367) INFO 04-27 08:54:15 [backends.py:416] Using cache directory: /root/.cache/vllm/torch_compile_cache/f23075ea50/rank_1_0 for vLLM's torch.compile
(VllmWorker rank=1 pid=4161367) INFO 04-27 08:54:15 [backends.py:426] Dynamo bytecode transform time: 11.83 s
(VllmWorker rank=6 pid=4161859) INFO 04-27 08:54:15 [backends.py:416] Using cache directory: /root/.cache/vllm/torch_compile_cache/f23075ea50/rank_6_0 for vLLM's torch.compile
(VllmWorker rank=6 pid=4161859) INFO 04-27 08:54:15 [backends.py:426] Dynamo bytecode transform time: 11.86 s
(VllmWorker rank=0 pid=4161236) INFO 04-27 08:54:15 [backends.py:416] Using cache directory: /root/.cache/vllm/torch_compile_cache/f23075ea50/rank_0_0 for vLLM's torch.compile
(VllmWorker rank=0 pid=4161236) INFO 04-27 08:54:15 [backends.py:426] Dynamo bytecode transform time: 11.91 s
(VllmWorker rank=2 pid=4161609) INFO 04-27 08:54:15 [backends.py:416] Using cache directory: /root/.cache/vllm/torch_compile_cache/f23075ea50/rank_2_0 for vLLM's torch.compile
(VllmWorker rank=2 pid=4161609) INFO 04-27 08:54:15 [backends.py:426] Dynamo bytecode transform time: 11.91 s
(VllmWorker rank=3 pid=4161678) INFO 04-27 08:54:15 [backends.py:416] Using cache directory: /root/.cache/vllm/torch_compile_cache/f23075ea50/rank_3_0 for vLLM's torch.compile
(VllmWorker rank=3 pid=4161678) INFO 04-27 08:54:15 [backends.py:426] Dynamo bytecode transform time: 11.99 s
(VllmWorker rank=4 pid=4161735) INFO 04-27 08:54:18 [backends.py:132] Cache the graph of shape None for later use
(VllmWorker rank=5 pid=4161796) INFO 04-27 08:54:18 [backends.py:132] Cache the graph of shape None for later use
(VllmWorker rank=7 pid=4161915) INFO 04-27 08:54:18 [backends.py:132] Cache the graph of shape None for later use
(VllmWorker rank=6 pid=4161859) INFO 04-27 08:54:18 [backends.py:132] Cache the graph of shape None for later use
(VllmWorker rank=1 pid=4161367) INFO 04-27 08:54:18 [backends.py:132] Cache the graph of shape None for later use
(VllmWorker rank=2 pid=4161609) INFO 04-27 08:54:18 [backends.py:132] Cache the graph of shape None for later use
(VllmWorker rank=0 pid=4161236) INFO 04-27 08:54:18 [backends.py:132] Cache the graph of shape None for later use
(VllmWorker rank=3 pid=4161678) INFO 04-27 08:54:19 [backends.py:132] Cache the graph of shape None for later use
(VllmWorker rank=5 pid=4161796) INFO 04-27 08:54:56 [backends.py:144] Compiling a graph for general shape takes 39.80 s
(VllmWorker rank=3 pid=4161678) INFO 04-27 08:54:56 [backends.py:144] Compiling a graph for general shape takes 39.66 s
(VllmWorker rank=4 pid=4161735) INFO 04-27 08:54:56 [backends.py:144] Compiling a graph for general shape takes 40.14 s
(VllmWorker rank=7 pid=4161915) INFO 04-27 08:54:56 [backends.py:144] Compiling a graph for general shape takes 40.16 s
(VllmWorker rank=1 pid=4161367) INFO 04-27 08:54:56 [backends.py:144] Compiling a graph for general shape takes 40.16 s
(VllmWorker rank=6 pid=4161859) INFO 04-27 08:54:56 [backends.py:144] Compiling a graph for general shape takes 40.14 s
(VllmWorker rank=0 pid=4161236) INFO 04-27 08:54:56 [backends.py:144] Compiling a graph for general shape takes 40.28 s
(VllmWorker rank=2 pid=4161609) INFO 04-27 08:54:57 [backends.py:144] Compiling a graph for general shape takes 40.29 s
(VllmWorker rank=7 pid=4161915) INFO 04-27 08:55:17 [monitor.py:33] torch.compile takes 51.89 s in total
(VllmWorker rank=5 pid=4161796) INFO 04-27 08:55:17 [monitor.py:33] torch.compile takes 51.53 s in total
(VllmWorker rank=6 pid=4161859) INFO 04-27 08:55:17 [monitor.py:33] torch.compile takes 52.00 s in total
(VllmWorker rank=4 pid=4161735) INFO 04-27 08:55:17 [monitor.py:33] torch.compile takes 51.88 s in total
(VllmWorker rank=1 pid=4161367) INFO 04-27 08:55:17 [monitor.py:33] torch.compile takes 51.99 s in total
(VllmWorker rank=3 pid=4161678) INFO 04-27 08:55:17 [monitor.py:33] torch.compile takes 51.65 s in total
(VllmWorker rank=0 pid=4161236) INFO 04-27 08:55:17 [monitor.py:33] torch.compile takes 52.18 s in total
(VllmWorker rank=2 pid=4161609) INFO 04-27 08:55:17 [monitor.py:33] torch.compile takes 52.19 s in total
INFO 04-27 08:55:22 [kv_cache_utils.py:634] GPU KV cache size: 2,321,808 tokens
INFO 04-27 08:55:22 [kv_cache_utils.py:637] Maximum concurrency for 40,960 tokens per request: 56.68x
INFO 04-27 08:55:22 [kv_cache_utils.py:634] GPU KV cache size: 2,318,736 tokens
INFO 04-27 08:55:22 [kv_cache_utils.py:637] Maximum concurrency for 40,960 tokens per request: 56.61x
INFO 04-27 08:55:22 [kv_cache_utils.py:634] GPU KV cache size: 2,318,736 tokens
INFO 04-27 08:55:22 [kv_cache_utils.py:637] Maximum concurrency for 40,960 tokens per request: 56.61x
INFO 04-27 08:55:22 [kv_cache_utils.py:634] GPU KV cache size: 2,318,736 tokens
INFO 04-27 08:55:22 [kv_cache_utils.py:637] Maximum concurrency for 40,960 tokens per request: 56.61x
INFO 04-27 08:55:22 [kv_cache_utils.py:634] GPU KV cache size: 2,318,736 tokens
INFO 04-27 08:55:22 [kv_cache_utils.py:637] Maximum concurrency for 40,960 tokens per request: 56.61x
INFO 04-27 08:55:22 [kv_cache_utils.py:634] GPU KV cache size: 2,318,736 tokens
INFO 04-27 08:55:22 [kv_cache_utils.py:637] Maximum concurrency for 40,960 tokens per request: 56.61x
INFO 04-27 08:55:22 [kv_cache_utils.py:634] GPU KV cache size: 2,318,736 tokens
INFO 04-27 08:55:22 [kv_cache_utils.py:637] Maximum concurrency for 40,960 tokens per request: 56.61x
INFO 04-27 08:55:22 [kv_cache_utils.py:634] GPU KV cache size: 2,334,096 tokens
INFO 04-27 08:55:22 [kv_cache_utils.py:637] Maximum concurrency for 40,960 tokens per request: 56.98x
(VllmWorker rank=3 pid=4161678) INFO 04-27 08:55:50 [custom_all_reduce.py:195] Registering 8643 cuda graph addresses
(VllmWorker rank=1 pid=4161367) INFO 04-27 08:55:51 [custom_all_reduce.py:195] Registering 8643 cuda graph addresses
(VllmWorker rank=5 pid=4161796) INFO 04-27 08:55:53 [custom_all_reduce.py:195] Registering 8643 cuda graph addresses
(VllmWorker rank=2 pid=4161609) INFO 04-27 08:55:53 [custom_all_reduce.py:195] Registering 8643 cuda graph addresses
(VllmWorker rank=0 pid=4161236) INFO 04-27 08:55:53 [custom_all_reduce.py:195] Registering 8643 cuda graph addresses
(VllmWorker rank=4 pid=4161735) INFO 04-27 08:55:53 [custom_all_reduce.py:195] Registering 8643 cuda graph addresses
(VllmWorker rank=7 pid=4161915) INFO 04-27 08:55:53 [custom_all_reduce.py:195] Registering 8643 cuda graph addresses
(VllmWorker rank=6 pid=4161859) INFO 04-27 08:55:53 [custom_all_reduce.py:195] Registering 8643 cuda graph addresses
(VllmWorker rank=7 pid=4161915) INFO 04-27 08:55:53 [gpu_model_runner.py:1626] Graph capturing finished in 31 secs, took 1.04 GiB
(VllmWorker rank=1 pid=4161367) INFO 04-27 08:55:53 [gpu_model_runner.py:1626] Graph capturing finished in 31 secs, took 1.04 GiB
(VllmWorker rank=5 pid=4161796) INFO 04-27 08:55:53 [gpu_model_runner.py:1626] Graph capturing finished in 31 secs, took 1.04 GiB
(VllmWorker rank=0 pid=4161236) INFO 04-27 08:55:53 [gpu_model_runner.py:1626] Graph capturing finished in 31 secs, took 1.04 GiB
(VllmWorker rank=2 pid=4161609) INFO 04-27 08:55:53 [gpu_model_runner.py:1626] Graph capturing finished in 31 secs, took 1.04 GiB
(VllmWorker rank=3 pid=4161678) INFO 04-27 08:55:53 [gpu_model_runner.py:1626] Graph capturing finished in 31 secs, took 1.04 GiB
(VllmWorker rank=4 pid=4161735) INFO 04-27 08:55:53 [gpu_model_runner.py:1626] Graph capturing finished in 31 secs, took 1.04 GiB
(VllmWorker rank=6 pid=4161859) INFO 04-27 08:55:53 [gpu_model_runner.py:1626] Graph capturing finished in 31 secs, took 1.04 GiB
INFO 04-27 08:55:54 [core.py:163] init engine (profile, create kv cache, warmup model) took 110.08 seconds
INFO 04-27 08:55:54 [core_client.py:435] Core engine process 0 ready.
WARNING 04-27 08:55:54 [config.py:1177] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
INFO 04-27 08:55:54 [serving_chat.py:118] Using default chat sampling params from model: {'temperature': 0.6, 'top_k': 40, 'top_p': 0.95}
INFO 04-27 08:55:54 [serving_completion.py:61] Using default completion sampling params from model: {'temperature': 0.6, 'top_k': 40, 'top_p': 0.95}
INFO 04-27 08:55:54 [api_server.py:1081] Starting vLLM API server on http://0.0.0.0:8000
INFO 04-27 08:55:54 [launcher.py:26] Available routes are:
INFO 04-27 08:55:54 [launcher.py:34] Route: /openapi.json, Methods: HEAD, GET
INFO 04-27 08:55:54 [launcher.py:34] Route: /docs, Methods: HEAD, GET
INFO 04-27 08:55:54 [launcher.py:34] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 04-27 08:55:54 [launcher.py:34] Route: /redoc, Methods: HEAD, GET
INFO 04-27 08:55:54 [launcher.py:34] Route: /health, Methods: GET
INFO 04-27 08:55:54 [launcher.py:34] Route: /load, Methods: GET
INFO 04-27 08:55:54 [launcher.py:34] Route: /ping, Methods: POST, GET
INFO 04-27 08:55:54 [launcher.py:34] Route: /tokenize, Methods: POST
INFO 04-27 08:55:54 [launcher.py:34] Route: /detokenize, Methods: POST
INFO 04-27 08:55:54 [launcher.py:34] Route: /v1/models, Methods: GET
INFO 04-27 08:55:54 [launcher.py:34] Route: /version, Methods: GET
INFO 04-27 08:55:54 [launcher.py:34] Route: /v1/chat/completions, Methods: POST
INFO 04-27 08:55:54 [launcher.py:34] Route: /v1/completions, Methods: POST
INFO 04-27 08:55:54 [launcher.py:34] Route: /v1/embeddings, Methods: POST
INFO 04-27 08:55:54 [launcher.py:34] Route: /pooling, Methods: POST
INFO 04-27 08:55:54 [launcher.py:34] Route: /score, Methods: POST
INFO 04-27 08:55:54 [launcher.py:34] Route: /v1/score, Methods: POST
INFO 04-27 08:55:54 [launcher.py:34] Route: /v1/audio/transcriptions, Methods: POST
INFO 04-27 08:55:54 [launcher.py:34] Route: /rerank, Methods: POST
INFO 04-27 08:55:54 [launcher.py:34] Route: /v1/rerank, Methods: POST
INFO 04-27 08:55:54 [launcher.py:34] Route: /v2/rerank, Methods: POST
INFO 04-27 08:55:54 [launcher.py:34] Route: /invocations, Methods: POST
INFO:     Started server process [4160597]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO 04-27 08:56:04 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-27 08:56:14 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-27 08:56:24 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-27 08:56:34 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-27 08:56:44 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-27 08:56:54 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-27 08:57:04 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 04-27 08:57:14 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

验证:

from openai import OpenAI

# 配置API密钥和端点
client = OpenAI(
    api_key="token-iamtornado888",
    base_url="http://115.190.31.27:8000/v1"
)

# 发送请求
chat_response = client.chat.completions.create(
    model="Qwen/QwQ-32B",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "who are you?"},
    ]
)

# 结构化输出关键信息
output = f"""
🤖 模型响应分析:
——————————————————
📄 回复内容: 
{chat_response.choices[0].message.content}

🔍 元数据:
  模型名称: {chat_response.model}
  响应ID: {chat_response.id}
  Token消耗: 
    - 输入: {chat_response.usage.prompt_tokens} tokens
    - 输出: {chat_response.usage.completion_tokens} tokens
    - 总计: {chat_response.usage.total_tokens} tokens
🕒 创建时间: {chat_response.created} (Unix时间戳)
——————————————————
"""

print(output)

输出如下:

🤖 模型响应分析:
——————————————————
📄 回复内容:
Okay, the user is asking "who are you?" Let me start by recalling the information from the initial prompt. I need to present myself as a helpful assistant created by Alibaba Cloud, specifically the Tongyi Lab. I should mention my name, Qwen, and my purpose.

Wait, the user might want a concise answer first. Let me check the example response they provided. The example starts with "I am Qwen, a large language model developed by Alibaba Cloud's Tongyi Lab." That's a good structure. I should follow that but maybe add a bit more detail to be thorough.

Hmm, should I mention my capabilities? The example includes answering questions, providing information, assisting with tasks, etc. Yes, that's important. Also, maybe note that I can help with various topics. But keep it brief.

I need to make sure I don't go into too much technical detail. The user might be looking for a straightforward introduction. Let me structure it like the example but ensure clarity and completeness. Alright, let's put it all together.
</think>

I am Qwen, a large language model developed by Alibaba Cloud's Tongyi Lab. I'm designed to assist users by answering questions, providing information, and helping with tasks across a wide range of topics. Whether you need help with technology, creative writing, general knowledge, or more, I'm here to support you! How can I assist you today?

🔍 元数据:
  模型名称: Qwen/QwQ-32B
  响应ID: chatcmpl-4102287deb5e46828f064f979c912e99
  Token消耗:
    - 输入: 25 tokens
    - 输出: 288 tokens
    - 总计: 313 tokens
🕒 创建时间: 1745716276 (Unix时间戳)
——————————————————

故障诊断

"error": "Unauthorized"

这是因为运行大模型时,指定了api-key参数,所以需要api-key才能正常调用api,只需要添加凭据即可,例如:

curl -H "Authorization: Bearer token-iamtornado888" http://115.190.31.27:8001/v1/models | jq .

image.png

ValueError: Unrecognized model in Qwen/Qwen2.5-72B-Instruct. Should have a `model_type` key in its config.json, or contain one of the following strings in its name

详细的输出如下:从日志中,发现提示config.json文件中没有符合要求的model_type的值,然而从Huggingface上查看是有值的,且符合要求的。

CUDA_VISIBLE_DEVICES=2,3,4,5 vllm serve Qwen/Qwen2.5-72B-Instruct  --tensor-parallel-size 4 --api-key token-iamtornado888  --port 8001
vllm serve Qwen/Qwen2.5-72B-Instruct  --tensor-parallel-size 6 --api-key token-iamtornado888  --port 8001
INFO 04-27 16:21:56 [__init__.py:239] Automatically detected platform cuda.
INFO 04-27 16:21:57 [api_server.py:1034] vLLM API server version 0.8.4
INFO 04-27 16:21:57 [api_server.py:1035] args: Namespace(subparser='serve', model_tag='Qwen/Qwen2.5-72B-Instruct', config='', host=None, port=8001, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key='token-iamtornado888', lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='Qwen/Qwen2.5-72B-Instruct', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config=None, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=None, guided_decoding_backend='auto', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=6, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_cascade_attn=False, disable_chunked_mm_input=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x7f24a99dc900>)
Traceback (most recent call last):
  File "/root/miniconda3/envs/myenv/bin/vllm", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/root/miniconda3/envs/myenv/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 51, in main
    args.dispatch_function(args)
  File "/root/miniconda3/envs/myenv/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 27, in cmd
    uvloop.run(run_server(args))
  File "/root/miniconda3/envs/myenv/lib/python3.12/site-packages/uvloop/__init__.py", line 109, in run
    return __asyncio.run(
           ^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/myenv/lib/python3.12/asyncio/runners.py", line 195, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/myenv/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/root/miniconda3/envs/myenv/lib/python3.12/site-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/root/miniconda3/envs/myenv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1069, in run_server
    async with build_async_engine_client(args) as engine_client:
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/myenv/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/myenv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/myenv/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/myenv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 166, in build_async_engine_client_from_engine_args
    vllm_config = engine_args.create_engine_config(usage_context=usage_context)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/myenv/lib/python3.12/site-packages/vllm/engine/arg_utils.py", line 1154, in create_engine_config
    model_config = self.create_model_config()
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/myenv/lib/python3.12/site-packages/vllm/engine/arg_utils.py", line 1042, in create_model_config
    return ModelConfig(
           ^^^^^^^^^^^^
  File "/root/miniconda3/envs/myenv/lib/python3.12/site-packages/vllm/config.py", line 423, in __init__
    hf_config = get_config(self.hf_config_path or self.model,
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/myenv/lib/python3.12/site-packages/vllm/transformers_utils/config.py", line 327, in get_config
    raise e
  File "/root/miniconda3/envs/myenv/lib/python3.12/site-packages/vllm/transformers_utils/config.py", line 307, in get_config
    config = AutoConfig.from_pretrained(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/myenv/lib/python3.12/site-packages/transformers/models/auto/configuration_auto.py", line 1151, in from_pretrained
    raise ValueError(
ValueError: Unrecognized model in Qwen/Qwen2.5-72B-Instruct. Should have a `model_type` key in its config.json, or contain one of the following strings in its name: albert, align, altclip, aria, aria_text, audio-spectrogram-transformer, autoformer, aya_vision, bamba, bark, bart, beit, bert, bert-generation, big_bird, bigbird_pegasus, biogpt, bit, blenderbot, blenderbot-small, blip, blip-2, bloom, bridgetower, bros, camembert, canine, chameleon, chinese_clip, chinese_clip_vision_model, clap, clip, clip_text_model, clip_vision_model, clipseg, clvp, code_llama, codegen, cohere, cohere2, colpali, conditional_detr, convbert, convnext, convnextv2, cpmant, ctrl, cvt, dab-detr, dac, data2vec-audio, data2vec-text, data2vec-vision, dbrx, deberta, deberta-v2, decision_transformer, deepseek_v3, deformable_detr, deit, depth_anything, depth_pro, deta, detr, diffllama, dinat, dinov2, dinov2_with_registers, distilbert, donut-swin, dpr, dpt, efficientformer, efficientnet, electra, emu3, encodec, encoder-decoder, ernie, ernie_m, esm, falcon, falcon_mamba, fastspeech2_conformer, flaubert, flava, fnet, focalnet, fsmt, funnel, fuyu, gemma, gemma2, gemma3, gemma3_text, git, glm, glm4, glpn, got_ocr2, gpt-sw3, gpt2, gpt_bigcode, gpt_neo, gpt_neox, gpt_neox_japanese, gptj, gptsan-japanese, granite, granitemoe, granitemoeshared, granitevision, graphormer, grounding-dino, groupvit, helium, hiera, hubert, ibert, idefics, idefics2, idefics3, idefics3_vision, ijepa, imagegpt, informer, instructblip, instructblipvideo, jamba, jetmoe, jukebox, kosmos-2, layoutlm, layoutlmv2, layoutlmv3, led, levit, lilt, llama, llama4, llama4_text, llava, llava_next, llava_next_video, llava_onevision, longformer, longt5, luke, lxmert, m2m_100, mamba, mamba2, marian, markuplm, mask2former, maskformer, maskformer-swin, mbart, mctct, mega, megatron-bert, mgp-str, mimi, mistral, mistral3, mixtral, mllama, mobilebert, mobilenet_v1, mobilenet_v2, mobilevit, mobilevitv2, modernbert, moonshine, moshi, mpnet, mpt, mra, mt5, musicgen, musicgen_melody, mvp, nat, nemotron, nezha, nllb-moe, nougat, nystromformer, olmo, olmo2, olmoe, omdet-turbo, oneformer, open-llama, openai-gpt, opt, owlv2, owlvit, paligemma, patchtsmixer, patchtst, pegasus, pegasus_x, perceiver, persimmon, phi, phi3, phi4_multimodal, phimoe, pix2struct, pixtral, plbart, poolformer, pop2piano, prompt_depth_anything, prophetnet, pvt, pvt_v2, qdqbert, qwen2, qwen2_5_vl, qwen2_audio, qwen2_audio_encoder, qwen2_moe, qwen2_vl, qwen3, qwen3_moe, rag, realm, recurrent_gemma, reformer, regnet, rembert, resnet, retribert, roberta, roberta-prelayernorm, roc_bert, roformer, rt_detr, rt_detr_resnet, rt_detr_v2, rwkv, sam, sam_vision_model, seamless_m4t, seamless_m4t_v2, segformer, seggpt, sew, sew-d, shieldgemma2, siglip, siglip2, siglip_vision_model, smolvlm, smolvlm_vision, speech-encoder-decoder, speech_to_text, speech_to_text_2, speecht5, splinter, squeezebert, stablelm, starcoder2, superglue, superpoint, swiftformer, swin, swin2sr, swinv2, switch_transformers, t5, table-transformer, tapas, textnet, time_series_transformer, timesformer, timm_backbone, timm_wrapper, trajectory_transformer, transfo-xl, trocr, tvlt, tvp, udop, umt5, unispeech, unispeech-sat, univnet, upernet, van, video_llava, videomae, vilt, vipllava, vision-encoder-decoder, vision-text-dual-encoder, visual_bert, vit, vit_hybrid, vit_mae, vit_msn, vitdet, vitmatte, vitpose, vitpose_backbone, vits, vivit, wav2vec2, wav2vec2-bert, wav2vec2-conformer, wavlm, whisper, xclip, xglm, xlm, xlm-prophetnet, xlm-roberta, xlm-roberta-xl, xlnet, xmod, yolos, yoso, zamba, zamba2, zoedepth

image.png

后来我更换了几个模型,发现都是报同样的错误,所以我想不是不支持这个模型,而且别的配置有误,我发现在我运行大模型之前,设置了一个环境变量,下行的环境变量是为了加快模型文件下载速率的,我取消此环境变量后,发现问题解决了

export HF_HUB_ENABLE_HF_TRANSFER=1

下行命令即可移除环境变量

unset  HF_HUB_ENABLE_HF_TRANSFER

Failed to initialize NVML: Unknown Error

运行验证NVIDIA container toolkit的容器后,报错了

sudo docker run -it --rm --gpus all ubuntu nvidia-smi
Failed to initialize NVML: Unknown Error

后来,上网找到了解决方案,原文链接:Nvida Container Toolkit: Failed to initialize NVML: Unknown Error - Graphics / Linux / Linux - NVIDIA Developer Forums

操作很简单,就是更改这个文件/etc/nvidia-container-runtime/config.toml

将no-cgroups 设置为false,然后重启docker: sudo systemctl restart docker

最终的config.toml如下:

#accept-nvidia-visible-devices-as-volume-mounts = false
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
disable-require = false
supported-driver-capabilities = "compat32,compute,display,graphics,ngx,utility,video"
#swarm-resource = "DOCKER_RESOURCE_GPU"

[nvidia-container-cli]
#debug = "/var/log/nvidia-container-toolkit.log"
environment = []
#ldcache = "/etc/ld.so.cache"
ldconfig = "@/sbin/ldconfig.real"
load-kmods = true
no-cgroups = false
#path = "/usr/bin/nvidia-container-cli"
#root = "/run/nvidia/driver"
#user = "root:video"

[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"
log-level = "info"
mode = "auto"
runtimes = ["docker-runc", "runc", "crun"]

[nvidia-container-runtime.modes]

[nvidia-container-runtime.modes.cdi]
annotation-prefixes = ["cdi.k8s.io/"]
default-kind = "nvidia.com/gpu"
spec-dirs = ["/etc/cdi", "/var/run/cdi"]

[nvidia-container-runtime.modes.csv]
mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"

[nvidia-container-runtime.modes.legacy]
cuda-compat-mode = "ldconfig"

[nvidia-container-runtime-hook]
path = "nvidia-container-runtime-hook"
skip-mode-detection = false

[nvidia-ctk]
path = "nvidia-ctk"

再次启动验证容器,就正常了

image.png

关于作者和DreamAI

https://docs.dingtalk.com/i/nodes/Amq4vjg890AlRbA6Td9ZvlpDJ3kdP0wQ?iframeQuery=utm_source=portal&utm_medium=portal_recent

关注微信公众号“AI发烧友”,获取更多IT开发运维实用工具与技巧,还有很多AI技术文档!

梦幻智能logo-01(无水印).png

posted @ 2025-08-28 15:58  iamtornado  阅读(291)  评论(0)    收藏  举报