Maximizing Confidence Alone Improves Reasoning

About

Reinforcement learning (RL) has enabled machine learning models to achieve significant advances in many fields. Most recently, RL has empowered frontier language models to solve challenging math, science, and coding problems. However, central to any RL algorithm is the reward function, and reward engineering is a notoriously difficult problem in any domain. In this paper, we propose RENT: Reinforcement Learning via Entropy Minimization -- a fully unsupervised RL method that requires no external reward or ground-truth answers, and instead uses the model's entropy of its underlying distribution as an intrinsic reward. We find that by reinforcing the chains of thought that yield high model confidence on its generated answers, the model improves its reasoning ability. In our experiments, we showcase these improvements on an extensive suite of commonly-used reasoning benchmarks, including GSM8K, MATH500, AMC, AIME, and GPQA, and models of varying sizes from the Qwen, Mistral, and Llama families. The generality of our unsupervised learning method lends itself to applicability in a wide range of domains where external supervision is unavailable.

Mihir Prabhudesai, Lili Chen, Alex Ippoliti, Katerina Fragkiadaki, Hao Liu, Deepak Pathak• 2025

Related benchmarks

Task	Dataset	Result
Instruction Following	IFEval	--	836
Mathematics	MATH 500	Pass@177.26	122
Mathematical Reasoning	AMC 23	Pass@1 Accuracy65	109
Mathematical Reasoning	MATH500	Accuracy71.4	76
Multi-task Language Understanding	MMLU-Pro	Pass@169.91	64
Mathematical Reasoning	Mathematical Reasoning Suite (AMC, AIME 2024, AIME 2025, Minerva, MATH, Olympiad) various (test val)	Average Score61.7	55
Math Reasoning	GSM8K	Pass@4 Accuracy93.78	54
Mathematics	AMC	pass@150.82	53
Mathematics	AIME 2024	Pass@10.1854	49
Math	GSM8K	Pass@191.2	47

Showing 10 of 37 rows

Other info

Follow for update

@wizwand_team Discord