LLaMa-Factory 使用 Llama-3-8B-Instruct 在 NVIDIA 双卡 Quadro P5000 16G 环境跑通4bit模型微调全过程
前言
由于 P5000 VRAM 有限,我们优先使用 QLoRA(4-bit 量化)模式进行高效微调,避免 OOM(内存不足)错误。
下载模型
https://modelscope.cn/models/LLM-Research/Meta-Llama-3-8B-Instruct/files
使用 modelscope 下载模型,python 环境安装 pip install modelscope
开始下载模型:
modelscope download --model LLM-Research/Meta-Llama-3-8B-Instruct --local_dir /root/niuben/Meta-Llama-3-8B-Instruct
下载 Llama Factory GPU Docker 镜像
Llama Factory 官方提供预构建的 GPU 镜像 hiyouga/llamafactory:latest,基于 CUDA 12.4、PyTorch 2.6.0 和 Flash-attn 2.7.4,支持 Quadro P5000。
docker pull hiyouga/llamafactory:latest
1. 硬件环境
Wed Dec 10 10:22:22 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.82.09 Driver Version: 580.82.09 CUDA Version: 13.0 |
|-------------------------------+----------------------+------------------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro P5000 Off | 00000000:00:06.0 Off | Off |
| 26% 37C P8 6W / 180W | 109MiB / 16384MiB | 0% Default |
| 1 Quadro P5000 Off | 00000000:00:07.0 Off | Off |
| 26% 32C P8 6W / 180W | 109MiB / 16384MiB | 0% Default |
+---------------------------------------------------------------------------------------+
2. 启动 LLaMa-Factory GPU 镜像
挂载模型、数据集和输出目录,支持 GPU 透传)
docker run -it --rm \
--gpus=all \
--ipc=host \
--network=host \
-v /root/niuben:/models/llama3 \
-p 7860:7860 -p 8000:8000 \
--name llamafactory-finetune \
hiyouga/llamafactory:latest \
/bin/bash
进入容器后(bash),验证环境:
nvidia-smi # 确认 P5000 被识别
python -c "import torch; print(torch.cuda.is_available())" # 应输出 True
pip list | grep -E "torch|transformers|peft" # 检查核心库(已预装)
3. 关键补丁(最新镜像漏装了 bitsandbytes)
pip install bitsandbytes>=0.39.0 -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple some-package
4. 开始 QLoRA 微调
这里 --dataset参数,使用了内置数据集 alpaca_en_demo,或者你可以指定自己的数据集路径
- alpaca_en_demo # 经典 Alpaca 英文 52K 的精简版,只有 1000 条,专门给显存小的人练手,几十分钟就能跑完
- identity_zh # 如果你想中文微调,直接改成这个(500 条高质量中文身份对齐数据)
- sharegpt_zh # 或者这个(2万+ 中文 ShareGPT 数据,显存够的话可以用)
--quantization_bit: 4 指定了 4-bit 量化
llamafactory-cli train \
--stage sft \
--do_train \
--model_name_or_path /models/llama3/Meta-Llama-3-8B-Instruct \
--template llama3 \
--finetuning_type lora \
--lora_target q_proj,v_proj \
--quantization_bit 4 \
--dataset alpaca_en_demo \
--cutoff_len 1024 \
--output_dir /output/llama3-qlora-p5000 \
--logging_steps 5 \
--save_steps 200 \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 4 \
--num_train_epochs 3 \
--learning_rate 1e-4 \
--bf16 \
--gradient_checkpointing \
--overwrite_output_dir
训练过程:会自动下载 tokenizer(如果缺失),开始 SFT。日志显示 loss 曲线。
监控:用 watch -n 0.1 nvidia-smi 观察 GPU 使用率。
微调成功,最后输出:
[INFO|trainer.py:4309] 2025-12-10 07:32:34,710 >> Saving model checkpoint to /output/llama3-qlora-p5000
[INFO|configuration_utils.py:763] 2025-12-10 07:32:34,777 >> loading configuration file /models/llama3/Meta-Llama-3-8B-Instruct/config.json
[INFO|configuration_utils.py:839] 2025-12-10 07:32:34,778 >> Model config LlamaConfig {
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 128000,
"dtype": "bfloat16",
"eos_token_id": 128009,
"head_dim": 128,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 14336,
"max_position_embeddings": 8192,
"mlp_bias": false,
"model_type": "llama",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 8,
"pretraining_tp": 1,
"rms_norm_eps": 1e-05,
"rope_scaling": null,
"rope_theta": 500000.0,
"tie_word_embeddings": false,
"transformers_version": "4.57.1",
"use_cache": true,
"vocab_size": 128256
}
[INFO|tokenization_utils_base.py:2421] 2025-12-10 07:32:34,816 >> chat template saved in /output/llama3-qlora-p5000/chat_template.jinja
[INFO|tokenization_utils_base.py:2590] 2025-12-10 07:32:34,820 >> tokenizer config file saved in /output/llama3-qlora-p5000/tokenizer_config.json
[INFO|tokenization_utils_base.py:2599] 2025-12-10 07:32:34,820 >> Special tokens file saved in /output/llama3-qlora-p5000/special_tokens_map.json
***** train metrics *****
epoch = 3.0
total_flos = 31421410GF
train_loss = 1.0571
train_runtime = 1:47:46.01
train_samples_per_second = 0.464
train_steps_per_second = 0.029
Figure saved at: /output/llama3-qlora-p5000/training_loss.png
[WARNING|2025-12-10 07:32:35] llamafactory.extras.ploting:148 >> No metric eval_loss to plot.
[WARNING|2025-12-10 07:32:35] llamafactory.extras.ploting:148 >> No metric eval_accuracy to plot.
[INFO|modelcard.py:456] 2025-12-10 07:32:35,317 >> Dropping the following result as it does not have all the necessary fields:
{'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}}
5. 模型微调后,不合并模型,直接加载 LoRA 聊天
CLI 聊天
llamafactory-cli chat \
--model_name_or_path /models/llama3/Meta-Llama-3-8B-Instruct \
--adapter_name_or_path /output/llama3-qlora-p5000 \
--template llama3 \
--finetuning_type lora \
--quantization_bit 4
或者你可以使用 Web UI:llamafactory-cli webui
6. 导出完整可部署模型(可选)
llamafactory-cli export \
--model_name_or_path /models/llama3/Meta-Llama-3-8B-Instruct \
--adapter_name_or_path /output/llama3-qlora-p5000 \
--template llama3 \
--finetuning_type lora \
--export_dir /output/merged-llama3-8b-instruct-cn \
--export_size 2 \
--export_device cpu
导出后直接用下面命令聊天(不需要原模型了):
llamafactory-cli chat \
--model_name_or_path /output/merged-llama3-8b-instruct-cn \
--template llama3 \
--quantization_bit 4

浙公网安备 33010602011771号