RLBFF: BINARY FLEXIBLE FEEDBACK TO BRIDGE BETWEEN HUMAN FEEDBACK & VERIFIABLE REWARDS
Language Models that Think, Chat Better