Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Video2Layout: Recall and Reconstruct Metric-Grounded Cognitive Map for Spatial Reasoning

About

Spatial intelligence is a critical frontier for Multimodal Large Language Models (MLLMs), empowering them to comprehend the physical world. Drawing inspiration from human perception mechanisms, prior studies attempt to construct a spatial understanding via grid-based cognitive maps. However, current grid-based map methods rely on discretized representations, which limit the model's ability in fine-grained spatial reasoning. To overcome this limitation, we propose Video2Layout, a framework for reconstructing metric-grounded spatial layouts from video. The framework uses continuous object boundary coordinates to enable quantitative spatial computation, which effectively reduces ambiguity in natural language descriptions of spatial relationships. Specifically, our method comprises two stages. First, in supervised fine-tuning stage, we construct a high-quality dataset from the AI2THOR simulator, which enables the model to learn the mapping from visual inputs to precise boundary coordinates. Subsequently, a reinforcement fine-tuning stage enhances the model's real-world generalization capabilities. Based on the above framework, we investigate factors that affect cognitive map accuracy and quantify its relationship with task performance. Evaluated on mainstream spatial reasoning benchmarks, our model, V2LO-7B, achieves an average improvement of 3.24\% over the model trained on grid maps, validating the superiority of our method.

Yibin Huang, Wang Xu, Wanyue Zhang, Helu Zhi, Jingjing Huang, Yangbin Xu, Yangang Sun, Conghui Zhu, Tiejun Zhao• 2025

Related benchmarks

TaskDatasetResultRank
Spatial ReasoningViewSpatial-Bench
Overall Score40.18
28
Spatial ReasoningSPAR-Bench
Overall Score36.68
23
Spatial ReasoningEmbodiedSpatial-Bench
Accuracy68.63
14
Spatial ReasoningQVS-Bench
Relative Distance68
10
Spatial ReasoningSpatial Reasoning Benchmarks Aggregate
Overall Score47.46
8
Omnidirectional Spatial ReasoningOmniSpatial-Bench
Overall Score44.36
8
Showing 6 of 6 rows

Other info

Follow for update