LLaMa-Factory 使用 Llama-3-8B-Instruct 在 NVIDIA 双卡 Quadro P5000 16G 环境跑通4bit模型微调全过程

前言

由于 P5000 VRAM 有限,我们优先使用 QLoRA(4-bit 量化)模式进行高效微调,避免 OOM(内存不足)错误。

下载模型

https://modelscope.cn/models/LLM-Research/Meta-Llama-3-8B-Instruct/files

使用 modelscope 下载模型,python 环境安装 pip install modelscope

开始下载模型:

modelscope download --model LLM-Research/Meta-Llama-3-8B-Instruct --local_dir /root/niuben/Meta-Llama-3-8B-Instruct

下载 Llama Factory GPU Docker 镜像

Llama Factory 官方提供预构建的 GPU 镜像 hiyouga/llamafactory:latest,基于 CUDA 12.4、PyTorch 2.6.0 和 Flash-attn 2.7.4,支持 Quadro P5000。

docker pull hiyouga/llamafactory:latest

1. 硬件环境

Wed Dec 10 10:22:22 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.82.09    Driver Version: 580.82.09    CUDA Version: 13.0             |
|-------------------------------+----------------------+------------------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro P5000       Off  | 00000000:00:06.0 Off |                  Off |
| 26%   37C    P8     6W / 180W |    109MiB / 16384MiB |      0%      Default |
|   1  Quadro P5000       Off  | 00000000:00:07.0 Off |                  Off |
| 26%   32C    P8     6W / 180W |    109MiB / 16384MiB |      0%      Default |
+---------------------------------------------------------------------------------------+

2. 启动 LLaMa-Factory GPU 镜像

挂载模型、数据集和输出目录,支持 GPU 透传)

docker run -it --rm \
  --gpus=all \
  --ipc=host \
  --network=host \
  -v /root/niuben:/models/llama3 \
  -p 7860:7860 -p 8000:8000 \
  --name llamafactory-finetune \
  hiyouga/llamafactory:latest \
  /bin/bash

进入容器后(bash),验证环境:

nvidia-smi  # 确认 P5000 被识别
python -c "import torch; print(torch.cuda.is_available())"  # 应输出 True
pip list | grep -E "torch|transformers|peft"  # 检查核心库(已预装)

3. 关键补丁(最新镜像漏装了 bitsandbytes)

pip install bitsandbytes>=0.39.0  -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple some-package

4. 开始 QLoRA 微调

这里 --dataset参数,使用了内置数据集 alpaca_en_demo,或者你可以指定自己的数据集路径

  • alpaca_en_demo # 经典 Alpaca 英文 52K 的精简版,只有 1000 条,专门给显存小的人练手,几十分钟就能跑完
  • identity_zh # 如果你想中文微调,直接改成这个(500 条高质量中文身份对齐数据)
  • sharegpt_zh # 或者这个(2万+ 中文 ShareGPT 数据,显存够的话可以用)

--quantization_bit: 4 指定了 4-bit 量化

llamafactory-cli train \
  --stage sft \
  --do_train \
  --model_name_or_path /models/llama3/Meta-Llama-3-8B-Instruct \
  --template llama3 \
  --finetuning_type lora \
  --lora_target q_proj,v_proj \
  --quantization_bit 4 \
  --dataset alpaca_en_demo \
  --cutoff_len 1024 \
  --output_dir /output/llama3-qlora-p5000 \
  --logging_steps 5 \
  --save_steps 200 \
  --per_device_train_batch_size 2 \
  --gradient_accumulation_steps 4 \
  --num_train_epochs 3 \
  --learning_rate 1e-4 \
  --bf16 \
  --gradient_checkpointing \
  --overwrite_output_dir

训练过程:会自动下载 tokenizer(如果缺失),开始 SFT。日志显示 loss 曲线。
监控:用 watch -n 0.1 nvidia-smi 观察 GPU 使用率。

微调成功,最后输出:

[INFO|trainer.py:4309] 2025-12-10 07:32:34,710 >> Saving model checkpoint to /output/llama3-qlora-p5000
[INFO|configuration_utils.py:763] 2025-12-10 07:32:34,777 >> loading configuration file /models/llama3/Meta-Llama-3-8B-Instruct/config.json
[INFO|configuration_utils.py:839] 2025-12-10 07:32:34,778 >> Model config LlamaConfig {
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "dtype": "bfloat16",
  "eos_token_id": 128009,
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 8192,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "transformers_version": "4.57.1",
  "use_cache": true,
  "vocab_size": 128256
}

[INFO|tokenization_utils_base.py:2421] 2025-12-10 07:32:34,816 >> chat template saved in /output/llama3-qlora-p5000/chat_template.jinja
[INFO|tokenization_utils_base.py:2590] 2025-12-10 07:32:34,820 >> tokenizer config file saved in /output/llama3-qlora-p5000/tokenizer_config.json
[INFO|tokenization_utils_base.py:2599] 2025-12-10 07:32:34,820 >> Special tokens file saved in /output/llama3-qlora-p5000/special_tokens_map.json
***** train metrics *****
  epoch                    =        3.0
  total_flos               = 31421410GF
  train_loss               =     1.0571
  train_runtime            = 1:47:46.01
  train_samples_per_second =      0.464
  train_steps_per_second   =      0.029
Figure saved at: /output/llama3-qlora-p5000/training_loss.png
[WARNING|2025-12-10 07:32:35] llamafactory.extras.ploting:148 >> No metric eval_loss to plot.
[WARNING|2025-12-10 07:32:35] llamafactory.extras.ploting:148 >> No metric eval_accuracy to plot.
[INFO|modelcard.py:456] 2025-12-10 07:32:35,317 >> Dropping the following result as it does not have all the necessary fields:
{'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}}

5. 模型微调后,不合并模型,直接加载 LoRA 聊天

CLI 聊天

llamafactory-cli chat \
  --model_name_or_path /models/llama3/Meta-Llama-3-8B-Instruct \
  --adapter_name_or_path /output/llama3-qlora-p5000 \
  --template llama3 \
  --finetuning_type lora \
  --quantization_bit 4

或者你可以使用 Web UI:llamafactory-cli webui

6. 导出完整可部署模型(可选)

llamafactory-cli export \
  --model_name_or_path /models/llama3/Meta-Llama-3-8B-Instruct \
  --adapter_name_or_path /output/llama3-qlora-p5000 \
  --template llama3 \
  --finetuning_type lora \
  --export_dir /output/merged-llama3-8b-instruct-cn \
  --export_size 2 \
  --export_device cpu

导出后直接用下面命令聊天(不需要原模型了):

llamafactory-cli chat \
  --model_name_or_path /output/merged-llama3-8b-instruct-cn \
  --template llama3 \
  --quantization_bit 4
posted @ 2025-12-10 22:43  牛奔  阅读(21)  评论(0)    收藏  举报