CoLVR: Enhancing Exploratory Latent Visual Reasoning via Contrastive Optimization

About

Due to the potential for exploratory reasoning of Latent Visual Reasoning, recent works tend to enable MLLMs (Multimodal Large Language Models) to perform visual reasoning by propagating continuous hidden states instead of decoding intermediate steps into discrete tokens. However, existing works typically rely on hard alignment objectives to force latent representations to match predefined visual features, thereby severely limiting the exploratory of latent reasoning process. To address this problem, we propose CoLVR (Contrastive Optimization for Latent Visual Reasoning). To obtain a more exploratory visual reasoning, CoLVR introduces a latent contrastive training framework. Firstly, CoLVR learns diverse and exploratory representations with a latent contrastive objective guided by angle-based perturbation, which expands the semantic latent space and avoids over-constrained embedding. Then, CoLVR employs a latent trajectory contrastive reward for RL (Reinforcement Learning) post-training to enable fine-grained optimization of latent visual reasoning process and thus fostering diverse reasoning behaviors. Experiments demonstrate that CoLVR significantly enhances the exploratory capability of latent representations, achieving average improvements of 5.83% on VSP and 8.00% on Jigsaw, while also outperforming existing latent models on out of domain benchmarks, with a 3.40% gain on MMStar. The data, codes, and models are released at https://github.com/Oscar-dzy/CoLVR.

Ziyang Ding, Linjian Meng, Yiming Wu, Yuhan Li, Yuhao Liu, Zhen Zhao• 2026

Related benchmarks

Task	Dataset	Result
Vision Understanding	MMVP	Accuracy72	45
Visual Reasoning	Jigsaw	Accuracy78	44
Visual Understanding	MMStar	--	16
Visual Understanding	CV-Bench	Accuracy76.95	15
Visual Understanding	VisPuzzle	Accuracy37	14
Visual Reasoning	Tertis	Accuracy44.67	9
Visual Spatial Perception	VSP Unseen	Accuracy (Level 7)43	9
Visual Spatial Perception	VSP Total	Accuracy (Total)65.83	9
Visual Spatial Perception	VSP Seen	Accuracy (Level 3)94	9
Visual Understanding	V*	Accuracy80.63	3

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord