Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Learning to Reason without External Rewards

About

Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision. We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data. We propose Intuitor, an RLIF method that uses a model's own confidence-termed self-certainty-as its sole reward signal. Intuitor replaces external rewards in Group Relative Policy Optimization (GRPO) with self-certainty scores, enabling fully unsupervised learning. Experiments demonstrate that Intuitor matches GRPO's performance on mathematical benchmarks while achieving better generalization to out-of-domain tasks like code generation, without requiring gold solutions or test cases. Our findings show that intrinsic model signals can drive effective learning across domains, offering a scalable alternative to RLVR for autonomous AI systems where verifiable rewards are unavailable. Code is available at https://github.com/sunblaze-ucb/Intuitor

Xuandong Zhao, Zhewei Kang, Aosong Feng, Sergey Levine, Dawn Song• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMATH
Accuracy37.2
882
Instruction FollowingIFEval--
836
Instruction FollowingAlpacaEval 2.0--
722
Mathematical ReasoningMATH 500
Accuracy (Acc)58.6
543
Instruction FollowingAlpacaEval
Win Rate40.11
420
Mathematical ReasoningGSM8K
Accuracy46.55
388
Mathematical ReasoningAIME 2024
Accuracy20.5
370
Mathematical ReasoningAMC
Accuracy (%)35.99
368
Mathematical ReasoningMATH
Accuracy47.6
338
Multi-hop Question AnsweringHotpotQA (test)--
311
Showing 10 of 105 rows
...

Other info

Follow for update