LLM | 正在尝试使用 verl

主要参考资料：

verl 的 GitHub：https://github.com/volcengine/verl
verl 的文档：verl documentation
安装 verl：verl documentation | Installation
官方 quick start：verl documentation | Quickstart: PPO training on GSM8K dataset

简单记录（目前）配置 verl 的流程：

1 配置 verl

直接使用了之前配置 llama factory 的环境，然后，直接按照官方 install 文档来配置。

官方文档中，要求 python >= 3.10、CUDA >= 12.8，我的 cuda 版本是 12.2，也没事（）

我没有使用 docker 来安装，而是直接用 pip 安装的。主要执行了以下命令：

# 1. 安装 FSDP backend
USE_MEGATRON=0 bash scripts/install_vllm_sglang_mcore.sh
# 如果希望安装 Megatron-LM backend，则可以执行 bash scripts/install_vllm_sglang_mcore.sh
# （我也不懂 backend 是什么）

# 2. 安装 verl
git clone https://github.com/volcengine/verl.git
cd verl
pip install --no-deps -e .

# 3. 安装 flash attention
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3%2Bcu12torch2.8cxx11abiFALSE-cp311-cp311-linux_x86_64.whl
# 直接 pip install flash-attn 是装不上的，需要从 wheel 装
# 这个命令对应 python 3.11、pytorch 2.8、cuda 12.2。对于其他环境，需要使用以下的网址，寻找对应的命令：
# Find Your Compatible Flash Attention Wheel - https://flashattn.dev/#finder

2 跑官方的 quick start，用 gsm8K 数据集 PPO 微调 0.5B 模型

官方 quick start：verl documentation | Quickstart: PPO training on GSM8K dataset

需要显卡有 20G 30G 左右的显存。

主要使用了这几条命令：

# 1. 下载 gsm8k 的数据集
python3 examples/data_preprocess/gsm8k.py --local_save_dir ~/<data_path>/gsm8k

然后，我们查看一下 gsm8k 数据集的内容：

# 运行以下 python 脚本，需要修改 <user_name> 和 <data_path>
from datasets import load_dataset

dataset_path = '/home/<user_name>/<data_path>/gsm8k/'
dataset = load_dataset('parquet', data_files={
    'train': f'{dataset_path}train.parquet',
    'test': f'{dataset_path}test.parquet'
})

print(dataset['train'][0])  # 查看第一条数据
print(dataset['train'].features)  # 查看特征结构

提供两条示例数据：

1
Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? Let's think step by step and output the final answer after "####".
'reward_model': {'ground_truth': '72', 'style': 'rule'}

2
Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How much did she earn? Let's think step by step and output the final answer after "####".
'reward_model':

具体的数据格式：

{
    "data_source": "openai/gsm8k", 
    "prompt": [{
        "content": "Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How much did she earn? Let\"s think step by step and output the final answer after "####".", 
        "role": "user"
    }], 
    "ability": "math", 
    "reward_model": {"ground_truth": "10", "style": "rule"}, 
    "extra_info": {
        "answer": "Weng earns 12/60 = $<<12/60=0.2>>0.2 per minute.\nWorking 50 minutes, she earned 0.2 x 50 = $<<0.2*50=10>>10.\n#### 10", "index": 1, "question": "Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How much did she earn?", 
        "split": "train"
    }
}

然后，我们来训练 PPO：

# 2. 直接训练 PPO，需要替换 <data_path> 和 cuda visible device
CUDA_VISIBLE_DEVICES=1 PYTHONUNBUFFERED=1 python3 -m verl.trainer.main_ppo \
 data.train_files=$HOME/<data_path>/gsm8k/train.parquet \
 data.val_files=$HOME/<data_path>/gsm8k/test.parquet \
 data.train_batch_size=256 \
 data.max_prompt_length=512 \
 data.max_response_length=512 \
 actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \
 actor_rollout_ref.actor.optim.lr=1e-6 \
 actor_rollout_ref.actor.ppo_mini_batch_size=64 \
 actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
 actor_rollout_ref.rollout.name=vllm \
 actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
 actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
 actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
 actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
 critic.optim.lr=1e-5 \
 critic.model.path=Qwen/Qwen2.5-0.5B-Instruct \
 critic.ppo_micro_batch_size_per_gpu=4 \
 algorithm.kl_ctrl.kl_coef=0.001 \
 trainer.logger=console \
 trainer.val_before_train=False \
 trainer.n_gpus_per_node=1 \
 trainer.nnodes=1 \
 trainer.save_freq=10 \
 trainer.test_freq=10 \
 trainer.total_epochs=15 2>&1 | tee verl_demo.log
# 这个 tee verl_demo.log，应该指的是把 log 存在 ./tee verl_demo.log 里

然后，测试性能：

# 3. 把训练得到的东西 merge 起来，得到模型（？）需要将 global_step_435 替换成实际训练的步数
CUDA_VISIBLE_DEVICES=1 python3 -m verl.model_merger merge \
    --backend fsdp \
    --local_dir checkpoints/verl_examples/gsm8k/global_step_435/actor \
    --target_dir checkpoints/verl_examples/gsm8k/global_step_435/actor/huggingface

# 4. merge 过后会得到一个 safetensor，可以直接用 llama factory 来 load
CUDA_VISIBLE_DEVICES=1 llamafactory-cli chat \
    --model_name_or_path checkpoints/verl_examples/gsm8k/global_step_435/actor/huggingface \
    --template qwen
# 然后，就可以与训练得到的 LLM 对话了

评价指标是 val-aux/openai/gsm8k/reward/mean@1。根据 log，在训练的前几十步，这个指标从 0.2 左右涨到了 0.47 左右，然后缓慢涨到了 0.55 左右。跑以上的命令，一共训练了 435 步。

训练的 log，可以在 ./verl_demo.log 这个文件里看。

还可以配置 wandb，但是还没研究。