Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Self-Evaluation Guided Beam Search for Reasoning

About

Breaking down a problem into intermediate steps has demonstrated impressive performance in Large Language Model (LLM) reasoning. However, the growth of the reasoning chain introduces uncertainty and error accumulation, making it challenging to elicit accurate final results. To tackle this challenge of uncertainty in multi-step reasoning, we introduce a stepwise self-evaluation mechanism to guide and calibrate the reasoning process of LLMs. We propose a decoding algorithm integrating the self-evaluation guidance via stochastic beam search. The self-evaluation guidance serves as a better-calibrated automatic criterion, facilitating an efficient search in the reasoning space and resulting in superior prediction quality. Stochastic beam search balances exploitation and exploration of the search space with temperature-controlled randomness. Our approach surpasses the corresponding Codex-backboned baselines in few-shot accuracy by $6.34\%$, $9.56\%$, and $5.46\%$ on the GSM8K, AQuA, and StrategyQA benchmarks, respectively. Experiment results with Llama-2 on arithmetic reasoning demonstrate the efficiency of our method in outperforming the baseline methods with comparable computational budgets. Further analysis in multi-step reasoning finds our self-evaluation guidance pinpoints logic failures and leads to higher consistency and robustness. Our code is publicly available at https://guideddecoding.github.io/.

Yuxi Xie, Kenji Kawaguchi, Yiran Zhao, Xu Zhao, Min-Yen Kan, Junxian He, Qizhe Xie• 2023

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMATH 500
Accuracy76.6
442
Mathematical ReasoningSVAMP
Accuracy81.7
403
Mathematical ReasoningGSM8K
Accuracy (GSM8K)75.51
358
Mathematical ReasoningAIME 2025
Accuracy22.67
227
Mathematical ReasoningAMC
Accuracy55.5
221
Mathematical ReasoningGSM-Hard
Solve Rate32.45
162
Mathematical ReasoningAIME24
Accuracy25
160
ReasoningARC
Accuracy81.74
94
ReasoningARC Challenge
Accuracy78.34
93
Logical reasoningReClor (test)
Accuracy60.4
87
Showing 10 of 34 rows

Other info

Follow for update