Reinforcing General Reasoning without Verifiers

About

The recent paradigm shift towards training large language models (LLMs) using DeepSeek-R1-Zero-style reinforcement learning (RL) on verifiable rewards has led to impressive advancements in code and mathematical reasoning. However, this methodology is limited to tasks where rule-based answer verification is possible and does not naturally extend to real-world domains such as chemistry, healthcare, engineering, law, biology, business, and economics. Current practical workarounds use an additional LLM as a model-based verifier; however, this introduces issues such as reliance on a strong verifier LLM, susceptibility to reward hacking, and the practical burden of maintaining the verifier model in memory during training. To address this and extend DeepSeek-R1-Zero-style training to general reasoning domains, we propose a verifier-free method (VeriFree) that bypasses answer verification and instead uses RL to directly maximize the probability of generating the reference answer. We compare VeriFree with verifier-based methods and demonstrate that, in addition to its significant practical benefits and reduced compute requirements, VeriFree matches and even surpasses verifier-based methods on extensive evaluations across MMLU-Pro, GPQA, SuperGPQA, and math-related benchmarks. Moreover, we provide insights into this method from multiple perspectives: as an elegant integration of training both the policy and implicit verifier in a unified model, and as a variational optimization approach. Code is available at https://github.com/sail-sg/VeriFree.

Xiangxin Zhou, Zichen Liu, Anya Sims, Haonan Wang, Tianyu Pang, Chongxuan Li, Liang Wang, Min Lin, Chao Du• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH 500	pass@186	239
Mathematical Reasoning	AIME 2024	Pass@1 Accuracy22.9	236
Mathematical Reasoning	AIME 2025	Pass@1 Accuracy20.2	192
Knowledge Reasoning	MMLU-Pro	--	120
Reasoning	BBH (test)	Accuracy43.3	94
General Reasoning	MMLU-Pro	pass@1 Accuracy62.3	93
Writing	WritingBench	Score69.5	74
Code	HumanEval+	Accuracy65.9	43
Logic reasoning	ZebraLogic	Score7.6	42
Scientific Reasoning	SuperGPQA	Mean@138	34

Showing 10 of 42 rows

Other info

Follow for update

@wizwand_team Discord