Visual Enhanced Depth Scaling for Multimodal Latent Reasoning
About
Multimodal latent reasoning has emerged as a promising paradigm that replaces explicit Chain-of-Thought (CoT) decoding with implicit feature propagation, simultaneously enhancing representation informativeness and reducing inference latency. By analyzing token-level gradient dynamics during latent training, we reveal two critical observations: (1) visual tokens exhibit significantly higher and more volatile gradient norms than their textual counterparts due to inherent language bias, resulting in systematic visual under-optimization; and (2) semantically simple tokens converge rapidly, whereas complex tokens exhibit persistent gradient instability constrained by fixed architectural depths. To address these limitations, we propose a visual replay module and routing depth scaling to collaboratively enhance visual perception and refine complicated latents for deeper contextual reasoning. The former module leverages causal self-attention to estimate token saliency, reinforcing fine-grained grounding through spatially-coherent constraints. Complementarily, the latter mechanism adaptively allocates additional reasoning steps to complex tokens, enabling deeper contextual refinement. Guided by a curriculum strategy that progressively internalizes explicit CoT into compact latent representations, our framework achieves state-of-the-art performance across diverse benchmarks while delivering substantial inference speedups over explicit CoT baselines.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | MathVista | Accuracy69.8 | 257 | |
| Mathematical Reasoning | MathVision | Accuracy25.89 | 144 | |
| Visual Perception | MMVP | Accuracy76.67 | 82 | |
| Multimodal Reasoning | M3CoT (test) | Total Acc73 | 47 | |
| Mathematical Reasoning | MMATH | Accuracy39.82 | 24 | |
| Compositional Reasoning | MMStar | Accuracy60.82 | 16 | |
| Visual Perception | HallusionBench | Accuracy68.63 | 15 | |
| Visual Perception | SeedBench-2-Plus | Accuracy66.86 | 15 | |
| Compositional Reasoning | BLINK | Accuracy57.96 | 12 | |
| Visual Perception | HR-Bench | Accuracy74.21 | 11 |