LLMs Fine-tuning 学习笔记（一）：trl+peft

1 基本信息

相关工具：

peft：用于微调大模型的python库
- 官方文档：https://huggingface.co/docs/peft
- Github主页：https://github.com/huggingface/peft
transformers：用于获取和使用开源社区中预训练模型的python库
- 官方文档：https://huggingface.co/transformers
- Github主页：https://github.com/huggingface/transformers
trl：使用强化学习算法来训练或微调模型的python库
- 官方文档：https://huggingface.co/docs/trl/
- Github主页：https://github.com/lvwerra/trl

使用RLHF训练LLM的三个基本步骤：

可选的基础模型（截至March 9, 2023）：

关于模型和GPU显存之间的关系

建议基础模型参数量大于100亿，这类模型全精度工作一般需要40G以上显存
在GPU上以全精度（FP32）加载模型，每10亿参数需要消耗 4GB显存，以半精度（FP32）加载模型需要的显存是全精度的一半
- 更多关于精度量化和显存优化的信息：https://huggingface.co/blog/hf-bitsandbytes-integration

关于模型/数据并行及分布式训练：

对RLHF的理解：

在RLHF中，Actor Model（生成模型）需要Instruct Tuning来学习如何follow指令，而Reward model将学习人类的偏好，对Actor Model的输出进行打分。因此，可以把Reward Model理解为一个针对Actor Model输出结果的分类器。

对PPO的理解：

Overview of the PPO training setup in TRL：
The active model is the model being trained, and a copy of it is periodically made as the reference model. When the policy changes, the reference model is used as a baseline to evaluate whether the changes made by the active model are good or bad.

在单GPU中完成RLHF的关键技术要素：adapters 和 8bit matrix multiplication

8-bit matrix multiplication
- 一个优化矩阵相乘的算法，可以降低Transformer中前馈和注意力计算阶段的显存消耗
  - 引入8-bit matrix multiplication后，和全精度（FP32）相比，模型对显存的消耗可以降低4倍
- 论文：LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale (arxiv.org)
Adapters
- Adapters，或者说LoRA，这是一种针对LLM的微调方法，其核心思想是把LLM中的需要变更的权重矩阵替换成Fine-tuning过程中习得的低秩近似（low-rank approximations）矩阵，以此减少Fine-tuning过程中的计算资源消耗
- 论文：LoRA: Low-Rank Adaptation of Large Language Models (arxiv.org)
注意 8-bit int8 training 和 Low Rank adaption 在 Parameter-Efficient Fine-Tuning (PEFT) 包中都有现成实现