Self-Evolving Spatial Reasoning in Vision Language Models via Geometric Logic Consistency

About

Vision-Language Models (VLMs) have made striking progress, yet their spatial reasoning remains fragile: models that answer an original input correctly can still fail under paired transformations with predictable answer mappings, revealing a gap between instance-level correctness and robust spatial reasoning. To address this, we propose Spatial Alignment via Geometric Evolution (SAGE), a self-evolving framework that enforces logical consistency in VLMs through geometric and linguistic duality operations. SAGE incorporates duality consistency as an auxiliary reward within GRPO training, encouraging models to produce logically coherent answers across original and transformed inputs. A dynamic operation pool continuously probes for inconsistencies, promoting challenging operations and retiring mastered ones, so that training focuses on the most informative signals. SAGE is model-agnostic, data-efficient compared to prior GRPO methods, and can be applied as a lightweight post-training stage to any existing VLM. Experiments on video and spatial reasoning benchmarks demonstrate consistent improvements over strong baselines and enhanced generalization to unseen data.

Junming Liu, Yuqi Li, Yifei Sun, Maonan Wang, Piotr Koniusz, Yirong Chen, Ding Wang• 2026

Related benchmarks

Task	Dataset	Result
Video Understanding	MVBench	Accuracy67.5	635
Spatial Reasoning	Viewspatial	Accuracy44.1	129
Video Understanding	MMVU	Accuracy65.4	91
Spatial Reasoning	MindCube	Accuracy44.5	91
Spatial Reasoning	CV-Bench	Accuracy77.2	89
Video Understanding	VideoMMMU	Accuracy51.7	67
Video Understanding	VideoMME	Accuracy65.1	33
Video Understanding	VSI-Bench	Accuracy37.7	23
Spatial Reasoning	MMSI	Score32.3	21
Video Understanding	TempCmps	Accuracy78.4	11

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord