Vision-Language Memory for Spatial Reasoning

About

Spatial reasoning is a critical capability for intelligent robots, yet current vision-language models (VLMs) still fall short of human-level performance in video-based spatial reasoning. This gap mainly stems from two challenges: a semantic-geometric misalignment that prevents consistent 3D understanding, and the absence of persistent memory to retain 3D representation and understanding across frames. To address these limitations, we present VLM$^2$, a Vision-Language Model with persistent Memory for spatial reasoning with a view-consistent, 3D-aware representation purely from 2D videos. Specifically, we incorporate a dual-memory module consisting of a working memory that operates as a sliding window to focus on immediate context, and an episodic memory that consolidates and stores critical information across frames. This design enables bounded and efficient spatial reasoning under a fixed computational cost. Extensive experiments on multiple benchmarks show that VLM$^2$ achieves state-of-the-art performance among video-based models, significantly advancing the frontier of visual-spatial intelligence.

Zuntao Liu, Yi Du, Taimeng Fu, Shaoshu Su, Cherie Ho, Chen Wang• 2025

Related benchmarks

Task	Dataset	Result
Spatial Reasoning	VSI-Bench	R.Dr.87.8	370
Spatial Reasoning	VSTI-Bench	Cam. Mov. Dir. Error76.8	30
Spatial VQA	SQA3D (test)	Overall Accuracy46.5	22

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord