随笔档案「2025年2月18日」：RLHF各种训练算法科普 ... - xiaoxi666

2025年2月18日

摘要：强化学习在LLM中的应用越来越多了，本文针对常见的几种训练算法，用生活中的例子做类比，帮助理解相关概念。包括：PPO、DRO、DPO、β-DPO、sDPO、RSO、IPO、GPO、KTO、ORPO、SimPO、R-DPO、RLOO，以及GRPO。 PPO（Proximal Policy Optim 阅读全文

posted @ 2025-02-18 23:18 xiaoxi666 阅读(890) 评论(0) 推荐(1)

TOP

xiaoxi666

Master programmers think of systems as stories to be told rather than programs to be written.

公告