Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing

About

As textual reasoning with large language models (LLMs) has advanced significantly, there has been growing interest in enhancing the multimodal reasoning capabilities of large vision-language models (LVLMs). However, existing methods primarily approach multimodal reasoning in a straightforward, text-centric manner, where both reasoning and answer derivation are conducted purely through text, with the only difference being the presence of multimodal input. As a result, these methods often encounter fundamental limitations in spatial reasoning tasks that demand precise geometric understanding and continuous spatial tracking-capabilities that humans achieve through mental visualization and manipulation. To address the limitations, we propose drawing to reason in space, a novel paradigm that enables LVLMs to reason through elementary drawing operations in the visual space. By equipping models with basic drawing operations, including annotating bounding boxes and drawing auxiliary lines, we empower them to express and analyze spatial relationships through direct visual manipulation, meanwhile avoiding the performance ceiling imposed by specialized perception tools in previous tool-integrated reasoning approaches. To cultivate this capability, we develop a three-stage training framework: cold-start training with synthetic data to establish basic drawing abilities, reflective rejection sampling to enhance self-reflection behaviors, and reinforcement learning to directly optimize for target rewards. Extensive experiments demonstrate that our model, named VILASR, consistently outperforms existing methods across diverse spatial reasoning benchmarks, involving maze navigation, static spatial reasoning, video-based reasoning, and multi-view-based reasoning tasks, with an average improvement of 18.4%.

Junfei Wu, Jian Guan, Kaituo Feng, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, Tieniu Tan• 2025

Related benchmarks

TaskDatasetResultRank
Spatial ReasoningVSI-Bench
Avg Score46.35
192
Multimodal ReasoningWeMath
Accuracy25.3
129
Multimodal ReasoningMathVision
Accuracy25
102
Multimodal ReasoningLogicVista
Accuracy32.2
99
Spatial ReasoningViewspatial
Accuracy35.7
92
Multimodal UnderstandingPOPE
POPE Score0.848
90
Multimodal ReasoningMathVerse
Accuracy29.4
84
Spatial ReasoningVSI-Bench 1.0 (test)
Relative Distance Error45.1
80
Multimodal ReasoningMMBench
Overall Score80.8
78
Visual ReasoningBLINK
Accuracy56.2
76
Showing 10 of 75 rows
...

Other info

Follow for update