Right Question is Already Half the Answer: Fully Unsupervised LLM Reasoning Incentivization

About

Existing methods to enhance the reasoning capability of large language models predominantly rely on supervised fine-tuning (SFT) followed by reinforcement learning (RL) on reasoning-specific data. These approaches critically depend on external supervisions--such as labeled reasoning traces, verified golden answers, or pre-trained reward models. In this work, we propose Entropy Minimized Policy Optimization (\ours), which makes an early attempt at fully unsupervised LLM reasoning incentivization. By continuously minimizing the predictive entropy of LLMs on unlabeled questions in a latent semantic space, \ours achieves competitive performance compared to supervised counterparts on both mathematical and free-form natural reasoning tasks. Specifically, without any supervised signals, \ours boosts the accuracy of Qwen2.5-Math-7B Base from 30.7\% to 48.1\% on mathematical benchmarks and improves the accuracy of Qwen2.5-7B Base from 32.1\% to 50.1\% on MMLU-Pro. Primary experiments and analysis are also provided to interpret the effectiveness of \ours. Code is available at https://github.com/QingyangZhang/EMPO.

Qingyang Zhang, Haitao Wu, Changqing Zhang, Peilin Zhao, Yatao Bian• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	GSM8K (test)	Accuracy83.2	954
Multi-hop Question Answering	2WikiMultihopQA	EM35.7	559
Mathematical Reasoning	MATH 500	Accuracy73	221
GUI Grounding	ScreenSpot Pro	Accuracy20.7	195
GUI Grounding	ScreenSpot	Avg Acc69.2	160
GUI Grounding	OSWorld-G	Average Score42.6	144
Mathematical Reasoning	AIME 24	Pass@1 Accuracy13.3	128
Mathematical Reasoning	AIME 2024	Accuracy @1615.8	81
Mathematical Reasoning	AMC 2023	Avg@16 Score60.2	48
GUI Grounding	MMBench-GUI-L2	Accuracy58.2	43

Showing 10 of 41 rows

Other info

Follow for update

@wizwand_team Discord