Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Reinforcing General Reasoning without Verifiers

About

The recent paradigm shift towards training large language models (LLMs) using DeepSeek-R1-Zero-style reinforcement learning (RL) on verifiable rewards has led to impressive advancements in code and mathematical reasoning. However, this methodology is limited to tasks where rule-based answer verification is possible and does not naturally extend to real-world domains such as chemistry, healthcare, engineering, law, biology, business, and economics. Current practical workarounds use an additional LLM as a model-based verifier; however, this introduces issues such as reliance on a strong verifier LLM, susceptibility to reward hacking, and the practical burden of maintaining the verifier model in memory during training. To address this and extend DeepSeek-R1-Zero-style training to general reasoning domains, we propose a verifier-free method (VeriFree) that bypasses answer verification and instead uses RL to directly maximize the probability of generating the reference answer. We compare VeriFree with verifier-based methods and demonstrate that, in addition to its significant practical benefits and reduced compute requirements, VeriFree matches and even surpasses verifier-based methods on extensive evaluations across MMLU-Pro, GPQA, SuperGPQA, and math-related benchmarks. Moreover, we provide insights into this method from multiple perspectives: as an elegant integration of training both the policy and implicit verifier in a unified model, and as a variational optimization approach. Code is available at https://github.com/sail-sg/VeriFree.

Xiangxin Zhou, Zichen Liu, Anya Sims, Haonan Wang, Tianyu Pang, Chongxuan Li, Liang Wang, Min Lin, Chao Du• 2025

Related benchmarks

TaskDatasetResultRank
Logic reasoningZebraLogic
Score7.6
42
ReasoningBBH (test)
Accuracy43.3
40
Scientific ReasoningGPQA Diamond
Pass@10.444
32
CodingHumanEval
HumanEval Mean Score0.72
28
General ReasoningMMLU-Pro
pass@1 Accuracy44.1
27
Knowledge ReasoningMMLU-Pro--
27
Scientific ReasoningSuperGPQA
Mean@138
24
CodeHumanEval+
Accuracy65.9
22
WritingWritingBench
Score69.5
20
Scientific ReasoningGPQA General
Pass@119.3
17
Showing 10 of 31 rows

Other info

Follow for update