MindJourney: Test-Time Scaling with World Models for Spatial Reasoning

About

Spatial reasoning in 3D space is central to human cognition and indispensable for embodied tasks such as navigation and manipulation. However, state-of-the-art vision-language models (VLMs) struggle frequently with tasks as simple as anticipating how a scene will look after an egocentric motion: they perceive 2D images but lack an internal model of 3D dynamics. We therefore propose MindJourney, a test-time scaling framework that grants a VLM with this missing capability by coupling it to a controllable world model based on video diffusion. The VLM iteratively sketches a concise camera trajectory, while the world model synthesizes the corresponding view at each step. The VLM then reasons over this multi-view evidence gathered during the interactive exploration. Without any fine-tuning, our MindJourney achieves over an average 7.7% performance boost on the representative spatial reasoning benchmark SAT, showing that pairing VLMs with world models for test-time scaling offers a simple, plug-and-play route to robust 3D reasoning. Meanwhile, our method also improves upon the test-time inference VLMs trained through reinforcement learning, which demonstrates the potential of our method that utilizes world models for test-time scaling.

Yuncong Yang, Jiageng Liu, Zheyuan Zhang, Siyuan Zhou, Reuben Tan, Jianwei Yang, Yilun Du, Chuang Gan• 2025

Related benchmarks

Task	Dataset	Result
Spatial Mental Modeling	SAT (real)	AVG84.7	41
Spatial Reasoning	SAT (real)	Accuracy (Pass@1)31.33	25
Spatial Reasoning	MMSI-Bench MindJourney Subset (162 questions) (test)	Accuracy0.3395	19
Spatial Mental Modeling	SAT (synthesized)	EgoM87.1	15
Visual Question Answering	EmbSpatial-Bench	Accuracy74.7	13
Visual Question Answering	SAT (real)	Accuracy78.7	13
Visual Question Answering	BLINK Relative-Depth	Accuracy83.1	12
Visual Question Answering	Spatial Reasoning Average	Accuracy79.1	12
Visual Question Answering	BLINK Spatial-Relation	Accuracy81.8	12
Visual Spatial Inference	VSI-Bench Tiny video-input	Object Count Score47	12

Showing 10 of 20 rows

Other info

Follow for update

@wizwand_team Discord