VReST: Enhancing Reasoning in Large Vision-Language Models through Tree Search and Self-Reward Mechanism

About

Large Vision-Language Models (LVLMs) have shown exceptional performance in multimodal tasks, but their effectiveness in complex visual reasoning is still constrained, especially when employing Chain-of-Thought prompting techniques. In this paper, we propose VReST, a novel training-free approach that enhances Reasoning in LVLMs through Monte Carlo Tree Search and Self-Reward mechanisms. VReST meticulously traverses the reasoning landscape by establishing a search tree, where each node encapsulates a reasoning step, and each path delineates a comprehensive reasoning sequence. Our innovative multimodal Self-Reward mechanism assesses the quality of reasoning steps by integrating the utility of sub-questions, answer correctness, and the relevance of vision-language clues, all without the need for additional models. VReST surpasses current prompting methods and secures state-of-the-art performance across three multimodal mathematical reasoning benchmarks. Furthermore, it substantiates the efficacy of test-time scaling laws in multimodal tasks, offering a promising direction for future research.

Congzhi Zhang, Jiawei Peng, Zhenglin Wang, Yilong Lai, Haowen Sun, Heng Chang, Fei Ma, Weijiang Yu• 2025

Related benchmarks

Task	Dataset	Result
Visual Mathematical Reasoning	MathVista (testmini)	Accuracy65.4	88
Multimodal Mathematical Reasoning	MathVista mini (test)	Overall Accuracy67.4	48
Visual Reasoning	CharXiv (val)	Text in Chart Accuracy37.95	16
Mathematical Reasoning	MATH-Vision mini (test)	ALG42.11	8
Multimodal Mathematical Reasoning	MATH-Vision (testmini)	Alg Score21.05	8

Showing 5 of 5 rows

Other info

Code

Follow for update

@wizwand_team Discord