Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning

About

Current multimodal latent reasoning often relies on external supervision (e.g., auxiliary images), ignoring intrinsic visual attention dynamics. In this work, we identify a critical Perception Gap in distillation: student models frequently mimic a teacher's textual output while attending to fundamentally divergent visual regions, effectively relying on language priors rather than grounded perception. To bridge this, we propose LaViT, a framework that aligns latent visual thoughts rather than static embeddings. LaViT compels the student to autoregressively reconstruct the teacher's visual semantics and attention trajectories prior to text generation, employing a curriculum sensory gating mechanism to prevent shortcut learning. Extensive experiments show that LaViT significantly enhances visual grounding, achieving up to +16.9% gains on complex reasoning tasks and enabling a compact 3B model to outperform larger open-source variants and proprietary models like GPT-4o.

Linquan Wu, Tianxiang Jiang, Yifei Dong, Haoyu Yang, Fengji Zhang, Shichaang Meng, Ai Xuan, Linqi Song, Jacky Keung• 2026

Related benchmarks

TaskDatasetResultRank
Vision-Intensive PerceptionV* Benchmark
Attr Score82.61
18
Fine-Grained PerceptionMMVP (test)
MMVP Score67.33
11
Multimodal RobustnessMMStar (test)
MMStar Score54.07
11
Visual ReasoningBLINK (test)
Rel Depth78.23
10
Showing 4 of 4 rows

Other info

Follow for update