Visual Enhanced Depth Scaling for Multimodal Latent Reasoning

About

Multimodal latent reasoning has emerged as a promising paradigm that replaces explicit Chain-of-Thought (CoT) decoding with implicit feature propagation, simultaneously enhancing representation informativeness and reducing inference latency. By analyzing token-level gradient dynamics during latent training, we reveal two critical observations: (1) visual tokens exhibit significantly smaller gradient norms than their textual counterparts due to inherent language bias, resulting in systematic visual under-optimization; and (2) semantically simple tokens converge rapidly, whereas complex tokens exhibit persistent gradient instability constrained by fixed architectural depths. To address these limitations, we propose a visual replay module and routing depth scaling to collaboratively enhance visual perception and refine complicated latents for deeper contextual reasoning. The former module leverages causal self-attention to estimate token saliency, reinforcing fine-grained grounding through spatially-coherent constraints. Complementarily, the latter mechanism adaptively allocates additional reasoning steps to complex tokens, enabling deeper contextual refinement. Guided by a curriculum strategy that progressively internalizes explicit CoT into compact latent representations, our framework achieves state-of-the-art performance across diverse benchmarks while delivering substantial inference speedups over explicit CoT baselines.

Yudong Han, Yong Wang, Zaiquan Yang, Zhen Qu, Liyuan Pan, Xiangxiang Chu• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MathVista	Accuracy69.8	382
Mathematical Reasoning	MathVision	Accuracy25.89	168
Visual Perception	MMVP	Accuracy76.67	118
Multimodal Reasoning	M3CoT (test)	Total Acc73	55
Mathematical Reasoning	MMATH	Accuracy39.82	36
Visual Perception	HallusionBench	Accuracy68.63	24
Compositional Reasoning	MMStar	Accuracy60.82	16
Visual Perception	SeedBench-2-Plus	Accuracy66.86	15
Compositional Reasoning	BLINK	Accuracy57.96	12
Visual Perception	HR-Bench	Accuracy74.21	11

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord