Verifying Meta-Awareness via Predictive Rewards in Reasoning Models

About

Recent research on reasoning models explores the meta-awareness of language models, including their ability to determine optimal thinking duration, recognize knowledge boundaries, and structure concept-level thinking. While current large reasoning models depend solely on answer-based verification, we show that adding meta-awareness objectives leads to significant performance gains over models without such meta-knowledge. MAPR (Meta-Awareness via Predictive Reward) utilizes a self-generated task of predicting rollout statistics - specifically length, pass-rate, and concepts used - allowing for verification against the actual statistics. Furthermore, by leveraging this self-predictive capability, the model can regulate its reasoning behavior by i) filtering out trivial or unsolvable prompts, ii) reducing lengthy generations that tend to be incorrect, and iii) generating hints relevant to the problem. The results are inspiring: MAPR yields significant improvements in both accuracy and training efficiency on various reasoning benchmarks. More specifically, our method can speed up GRPO training by over 1.28x to reach the same performance, and achieve 83.18% gain in accuracy on AIME25, and a 13.04% average gain over six mathematics benchmarks. The code is publicly available at https://github.com/akatigre/MAPR-RL.

Yoonjeon Kim, Doohyuk Jang, Eunho Yang• 2025

Related benchmarks

Task	Dataset	Result
Scientific Reasoning	ARC Challenge	--	121
Code Generation	EvalPlus	Pass@177.66	118
Scientific Reasoning	GPQA Diamond	Pass@1 Accuracy53.72	67
Mathematical Reasoning	Olympiad	Pass@161.59	41
Mathematical Reasoning	AMC23	Pass@1 Score86.02	39
Coding	MBPP	--	37
Logical reasoning	Logical Deduction	Pass@181.03	20
Scientific Reasoning	SciBench	Pass@129.64	12
Coding	LiveCodeBench	Total Pass Rate31.61	11
Mathematical Reasoning	AIME 2024	Pass@144.27	6

Showing 10 of 17 rows

Other info

GitHub

Follow for update

@wizwand_team Discord