Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

VReST: Enhancing Reasoning in Large Vision-Language Models through Tree Search and Self-Reward Mechanism

About

Large Vision-Language Models (LVLMs) have shown exceptional performance in multimodal tasks, but their effectiveness in complex visual reasoning is still constrained, especially when employing Chain-of-Thought prompting techniques. In this paper, we propose VReST, a novel training-free approach that enhances Reasoning in LVLMs through Monte Carlo Tree Search and Self-Reward mechanisms. VReST meticulously traverses the reasoning landscape by establishing a search tree, where each node encapsulates a reasoning step, and each path delineates a comprehensive reasoning sequence. Our innovative multimodal Self-Reward mechanism assesses the quality of reasoning steps by integrating the utility of sub-questions, answer correctness, and the relevance of vision-language clues, all without the need for additional models. VReST surpasses current prompting methods and secures state-of-the-art performance across three multimodal mathematical reasoning benchmarks. Furthermore, it substantiates the efficacy of test-time scaling laws in multimodal tasks, offering a promising direction for future research.

Congzhi Zhang, Jiawei Peng, Zhenglin Wang, Yilong Lai, Haowen Sun, Heng Chang, Fei Ma, Weijiang Yu• 2025

Related benchmarks

TaskDatasetResultRank
Visual Mathematical ReasoningMathVista (testmini)
Accuracy65.4
33
Multimodal Mathematical ReasoningMathVista mini (test)
Overall Accuracy67.4
33
Visual ReasoningCharXiv (val)
Text in Chart Accuracy37.95
16
Mathematical ReasoningMATH-Vision mini (test)
ALG42.11
8
Multimodal Mathematical ReasoningMATH-Vision (testmini)
Alg Score21.05
8
Showing 5 of 5 rows

Other info

Code

Follow for update