Multi-chain Graph Refinement and Selection for Reliable Reasoning in Large Language Models

About

The complex reasoning ability of Large Language Models (LLMs) poses a critical bottleneck for their practical applications. Test-time expansion methods such as Tree-of-Thought (ToT) and Graph-of-Thought (GoT) enhance reasoning by introducing intermediate reasoning structures, tree search, or graph-based exploration mechanisms. However, their reasoning strategies suffer from limited diversity, redundant search branches, and inadequate integration and error correction across heterogeneous reasoning paths. To address these limitations, we propose a novel reasoning framework called Multi-chain Graph Refinement & Selection (MGRS), which first generates multiple diverse reasoning trajectories for a given problem, refines candidate responses using a composite self- and cross-verification strategy, then constructs a reasoning relation graph and estimates the success rate of intermediate nodes, and finally computes cumulative success rates to select the most reliable answer and corresponding reasoning trajectory. Experimental results demonstrate that MGRS significantly advances both the reasoning capability and computational efficiency of reasoning enhancement methods. Across six benchmark datasets spanning four distinct tasks, MGRS achieves an average accuracy of 82.9%, outperforming state-of-the-art baselines by a clear margin of 2.1%. Remarkably, on the 24-point game, MGRS attains 100% accuracy for the first time, while delivering a 13.6x speed-up compared to the leading Forest of Thoughts framework.

Yujiao Yang, Jing Lian, Linhui Li• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	GSM8K (test)	Accuracy96.5	954
Multi-hop Question Answering	HotpotQA N=1,000 (test)	F1 Score83.5	23
Arithmetic Reasoning	Game of 24 95 (test)	Success Rate100	9
Knowledge-intensive reasoning	MMLU-CF first 1,000 samples (test)	Exact Match Accuracy74.2	7
Logical reasoning	BBH multiple-choice (first 1,000 samples)	Exact Match Accuracy86.2	7
Mathematical Reasoning	MATH first 1,000 samples (test)	Exact Match Accuracy86.8	7
Multi-hop Reasoning	LongBench MuSiQue and WikiMultiHopQA	F1 Score69.9	7

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord