Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Reinforcing General Reasoning without Verifiers

About

The recent paradigm shift towards training large language models (LLMs) using DeepSeek-R1-Zero-style reinforcement learning (RL) on verifiable rewards has led to impressive advancements in code and mathematical reasoning. However, this methodology is limited to tasks where rule-based answer verification is possible and does not naturally extend to real-world domains such as chemistry, healthcare, engineering, law, biology, business, and economics. Current practical workarounds use an additional LLM as a model-based verifier; however, this introduces issues such as reliance on a strong verifier LLM, susceptibility to reward hacking, and the practical burden of maintaining the verifier model in memory during training. To address this and extend DeepSeek-R1-Zero-style training to general reasoning domains, we propose a verifier-free method (VeriFree) that bypasses answer verification and instead uses RL to directly maximize the probability of generating the reference answer. We compare VeriFree with verifier-based methods and demonstrate that, in addition to its significant practical benefits and reduced compute requirements, VeriFree matches and even surpasses verifier-based methods on extensive evaluations across MMLU-Pro, GPQA, SuperGPQA, and math-related benchmarks. Moreover, we provide insights into this method from multiple perspectives: as an elegant integration of training both the policy and implicit verifier in a unified model, and as a variational optimization approach. Code is available at https://github.com/sail-sg/VeriFree.

Xiangxin Zhou, Zichen Liu, Anya Sims, Haonan Wang, Tianyu Pang, Chongxuan Li, Liang Wang, Min Lin, Chao Du• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMATH 500
pass@186
239
Mathematical ReasoningAIME 2024
Pass@1 Accuracy22.9
165
Mathematical ReasoningAIME 2025
Pass@1 Accuracy20.2
118
General ReasoningMMLU-Pro
pass@1 Accuracy62.3
69
ReasoningBBH (test)
Accuracy43.3
67
WritingWritingBench
Score69.5
58
Logic reasoningZebraLogic
Score7.6
42
Knowledge ReasoningMMLU-Pro--
40
CodeHumanEval+
Accuracy65.9
34
Scientific ReasoningSuperGPQA
Mean@138
34
Showing 10 of 38 rows

Other info

Follow for update