V-STaR: Training Verifiers for Self-Taught Reasoners

About

Common self-improvement approaches for large language models (LLMs), such as STaR, iteratively fine-tune LLMs on self-generated solutions to improve their problem-solving ability. However, these approaches discard the large amounts of incorrect solutions generated during this process, potentially neglecting valuable information in such solutions. To address this shortcoming, we propose V-STaR that utilizes both the correct and incorrect solutions generated during the self-improvement process to train a verifier using DPO that judges correctness of model-generated solutions. This verifier is used at inference time to select one solution among many candidate solutions. Running V-STaR for multiple iterations results in progressively better reasoners and verifiers, delivering a 4% to 17% test accuracy improvement over existing self-improvement and verification approaches on common code generation and math reasoning benchmarks with LLaMA2 models.

Arian Hosseini, Xingdi Yuan, Nikolay Malkin, Aaron Courville, Alessandro Sordoni, Rishabh Agarwal• 2024

Related benchmarks

Task	Dataset	Result
Symbolic Reasoning	Letter	Accuracy74.67	67
Algorithmic Reasoning	MATH	Accuracy76.8	46
Reasoning	Bamboogle	Accuracy63	46
Symbolic Reasoning	COIN	Accuracy77	45
Code Generation	HumanEval OOD	Pass@128.04	39
Mathematical Reasoning	MATH OOD	Accuracy28.85	38
Domain-specific Reasoning	LegalBench	Accuracy64.21	33
Mathematical Reasoning	GSM-Hard	Accuracy48.4	31
Mathematical Reasoning	GSM8K 10 (test)	m1@t128	24
Domain Reasoning	HL	Accuracy75	23

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord