Incentivizing LLMs to Self-Verify Their Answers

About

Large Language Models (LLMs) have demonstrated remarkable progress in complex reasoning tasks through both post-training and test-time scaling laws. While prevalent test-time scaling approaches are often realized by using external reward models to guide the model generation process, we find that only marginal gains can be acquired when scaling a model post-trained on specific reasoning tasks. We identify that the limited improvement stems from distribution discrepancies between the specific post-trained generator and the general reward model. To address this, we propose a framework that incentivizes LLMs to self-verify their own answers. By unifying answer generation and verification within a single reinforcement learning (RL) process, we train models that can effectively assess the correctness of their own solutions. The trained model can further scale its performance at inference time by verifying its generations, without the need for external verifiers. We train our self-verification models based on Qwen2.5-Math-7B and DeepSeek-R1-Distill-Qwen-1.5B, demonstrating their capabilities across varying reasoning context lengths. Experiments on multiple mathematical reasoning benchmarks show that our models can not only improve post-training performance but also enable effective test-time scaling.

Fuxiang Zhang, Jiacheng Xu, Chaojie Wang, Ce Cui, Yang Liu, Bo An• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH 500	pass@193.5	239
Mathematical Reasoning	Minerva	Pass@141	138
Mathematical Reasoning	AMC	Pass@192.5	112
Mathematical Reasoning	AIME 2025	Pass@139.45	96
Mathematical Reasoning	AIME 2024	Pass@155.8	86
Mathematical Reasoning	AMC23	Avg@1659.2	36
Mathematical Reasoning	AIME 25	Average@169.8	18
Mathematical Reasoning	Olympiad	Acc@1640.5	12
Mathematical Reasoning	MATH500	Acc@1673.5	12
Mathematical Reasoning	Minerva	Acc@1624.6	12

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord