A Minimaximalist Approach to Reinforcement Learning from Human Feedback
摘要:
基于强化学习的 SPO(Self-Play Preference Optimization,自博弈偏好优化),该论文针对传统从人类反馈强化学习(RLHF/PbRL)方法的缺陷,提出自博弈偏好优化(Self-Play Preference Optimization, SPO) 算法,核心思想可概括为 阅读全文
posted @ 2025-08-26 18:18 limingqi 阅读(12) 评论(0) 推荐(0)