增量预训练 (CPT, Continuous Pre-Training) llama-factory 训练配置

前一阵子 qwen3 模型出来了，正好公司新的 GPU 资源也申请下来，就基于新的基座模型重新训练一下，实现性能飞跃嘿嘿。

1. 路径依赖篇

由于上一版的模型是基于 Qwen2.5-Coder:3b 训练的，服务器也只有 A100 80G * 2，所以在用 llamafactory 训练的时候没有考虑参数并行等问题，直接用了模板给的默认训练配置：

bf16: true
cutoff_len: 2048
dataset: afsim_train_data
dataset_dir: data
ddp_timeout: 180000000
do_train: true
finetuning_type: full
flash_attn: auto
gradient_accumulation_steps: 8
learning_rate: 5.0e-05
logging_steps: 5
lr_scheduler_type: cosine
max_grad_norm: 1.0
max_samples: 100000
model_name_or_path: /data/wzr/LLM-MODELS/Qwen/Qwen2___5-Coder-3B-Instruct/
num_train_epochs: 3.0
optim: adamw_torch
output_dir: saves/Qwen2.5-Coder-3B-Instruct/full/train_2025-01-07-21-03-26
packing: true
per_device_train_batch_size: 2
plot_loss: true
preprocessing_num_workers: 16
report_to: none
save_steps: 100
stage: pt
template: qwen
trust_remote_code: true
warmup_steps: 0

当时看两张 A100 卡上的资源都跑的满满的，加上官方文档也提到：

如果 CUDA_VISIBLE_DEVICES 没有指定，则默认使用所有GPU

我不知怎么的就以为这个训练使用了参数并行或者张量并行（可怕的潜意识）。直到我拿到新的服务器（A800 80G * 8），发现连 qwen3-14b 的模型都 OOM 了。

我先去查了一下理论显存的消耗：

Total VRAM = Model Parameters + Optimizer States + Activations

Method	8B	14B	30B
Full (pure_bf16)	60GB	120GB	300GB

（注：以上表格中为估算值，pre-train 阶段的训练实际还要更大些）

除了优化算法、batch size、cutoff length 等因素的差异，数量级应该差不多就是表格中的数值了。8卡服务器不可能连14b的模型都跑不起来。
最开始以为是 llamafactory 的 bug，毕竟 qwen3 出来时间还比较短，万一没适配好或者底层依赖库有问题什么的。搜索一番 issue 好像没人提这个问题，考虑到 qwen3 在社区的热度很高，不可能有明显的 bug 还没人发现。此时我已经隐隐约约感到可能是训练参数配置的问题了。

2. 再查文档篇

果然，llamafactory 官方文档中分布训练-DeepSpeed 写的清清楚楚：

DeepSpeed 是由微软开发的一个开源深度学习优化库，旨在提高大模型训练的效率和速度。在使用 DeepSpeed 之前，您需要先估计训练任务的显存大小，再根据任务需求与资源情况选择合适的 ZeRO 阶段。
ZeRO-1: 仅划分优化器参数，每个GPU各有一份完整的模型参数与梯度。
ZeRO-2: 划分优化器参数与梯度，每个GPU各有一份完整的模型参数。
ZeRO-3: 划分优化器参数、梯度与模型参数。

也就是说，如果没有显式配置 DeepSpeed 的 stage 为 ZeRO-3，那么 14b 的模型权重、gradient 等数据会在所有卡上都复制一份，那就已经占用了接近一半的显存了。

最后使用如下参数启动训练：

llamafactory-cli train \
    --stage pt \
    --do_train True \
    --model_name_or_path /root/private_data/SothisAI/model/Aihub/Qwen3-30B-A3B/main/Qwen3-30B-A3B \
    --preprocessing_num_workers 16 \
    --finetuning_type full \
    --template qwen3 \
    --flash_attn auto \
    --dataset_dir data \
    --dataset afsim_train_data \
    --cutoff_len 2048 \
    --learning_rate 5e-05 \
    --num_train_epochs 5.0 \
    --max_samples 100000 \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 8 \
    --lr_scheduler_type cosine \
    --max_grad_norm 1.0 \
    --logging_steps 5 \
    --save_steps 100 \
    --warmup_steps 0 \
    --packing True \
    --enable_thinking True \
    --report_to none \
    --freeze_vision_tower True \
    --freeze_multi_modal_projector True \
    --image_max_pixels 589824 \
    --image_min_pixels 1024 \
    --video_max_pixels 65536 \
    --video_min_pixels 256 \
    --output_dir saves/Qwen3-30B-A3B/full/train_deepspeed_z3_2025-05-25-17-02-33 \
    --bf16 True \
    --plot_loss True \
    --trust_remote_code True \
    --ddp_timeout 180000000 \
    --include_num_input_tokens_seen True \
    --optim adamw_torch \
    --deepspeed cache/ds_z3_config.json

连 30b 的模型都跑的稳稳的，每张卡大概占用 70G 的显存。

posted @ 2025-05-25 19:49 zion03 阅读(1064) 评论(0) 收藏举报

刷新页面返回顶部

CD Yang

增量预训练 (CPT, Continuous Pre-Training) llama-factory 训练配置

1. 路径依赖篇

2. 再查文档篇

公告