Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

V-STaR: Training Verifiers for Self-Taught Reasoners

About

Common self-improvement approaches for large language models (LLMs), such as STaR, iteratively fine-tune LLMs on self-generated solutions to improve their problem-solving ability. However, these approaches discard the large amounts of incorrect solutions generated during this process, potentially neglecting valuable information in such solutions. To address this shortcoming, we propose V-STaR that utilizes both the correct and incorrect solutions generated during the self-improvement process to train a verifier using DPO that judges correctness of model-generated solutions. This verifier is used at inference time to select one solution among many candidate solutions. Running V-STaR for multiple iterations results in progressively better reasoners and verifiers, delivering a 4% to 17% test accuracy improvement over existing self-improvement and verification approaches on common code generation and math reasoning benchmarks with LLaMA2 models.

Arian Hosseini, Xingdi Yuan, Nikolay Malkin, Aaron Courville, Alessandro Sordoni, Rishabh Agarwal• 2024

Related benchmarks

TaskDatasetResultRank
Symbolic ReasoningLetter
Accuracy74.67
67
Algorithmic ReasoningMATH
Accuracy76.8
46
ReasoningBamboogle
Accuracy63
46
Symbolic ReasoningCOIN
Accuracy77
45
Domain-specific ReasoningLegalBench
Accuracy64.21
33
Mathematical ReasoningMATH OOD
Accuracy28.85
30
Code GenerationHumanEval OOD
Pass@128.04
30
Mathematical ReasoningGSM-Hard
Accuracy48.4
28
Mathematical ReasoningGSM8K 10 (test)
m1@t128
24
Domain ReasoningHL
Accuracy75
23
Showing 10 of 13 rows

Other info

Follow for update