VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search
About
Recent advancements in Large Vision-Language Models have showcased remarkable capabilities. However, they often falter when confronted with complex reasoning tasks that humans typically address through visual aids and deliberate, step-by-step thinking. While existing methods have explored text-based slow thinking or rudimentary visual assistance, they fall short of capturing the intricate, interleaved nature of human visual-verbal reasoning processes. To overcome these limitations and inspired by the mechanisms of slow thinking in human cognition, we introduce VisuoThink, a novel framework that seamlessly integrates visuospatial and linguistic domains. VisuoThink facilitates multimodal slow thinking by enabling progressive visual-textual reasoning and incorporates test-time scaling through look-ahead tree search. Extensive experiments demonstrate that VisuoThink significantly enhances reasoning capabilities via inference-time scaling, even without fine-tuning, achieving state-of-the-art performance in tasks involving geometry and spatial reasoning.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Geometric Reasoning | Geomverse-109 | Accuracy@128.9 | 15 | |
| Geometric Reasoning | Geometry3K | Accuracy@143.8 | 15 | |
| Visual Navigation | Visual Navigation (level-3) | Pass@193.8 | 15 | |
| Visual Navigation | Visual Navigation (level-4) | Pass@161.3 | 15 | |
| Visual Tiling | Visual Tiling (level-2) | Pass@18.40e+3 | 15 | |
| Visual Navigation | Visual Navigation level-5 | Pass@15.32e+3 | 10 |