Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model

About

In language reasoning, longer chains of thought consistently yield better performance, which naturally suggests that visual latent reasoning may likewise benefit from longer latent sequences. However, we discover a counterintuitive phenomenon: the performance of existing latent visual reasoning methods systematically degrades as the latent sequence grows longer. We reveal the root cause: Information Gain Collapse -- autoregressive generation makes each step highly dependent on prior outputs, so subsequent tokens can barely introduce new information. We further identify that heavily pooled ($\geq 128\times$) image embeddings used as supervision targets provide no more signal than meaningless placeholders. Motivated by these insights, we propose SCOLAR (Self-COnsistent LAtent Reasoning), which introduces a lightweight detransformer that leverages the LLM's full-sequence hidden states to generate auxiliary visual tokens in a single shot, with each token independently anchored to the original visual space. Combined with three-stage SFT and ALPO reinforcement learning, SCOLAR extends acceptable latent CoT length by over $30\times$, achieves state-of-the-art among open-source models on real-world reasoning benchmarks (+14.12% over backbone), and demonstrates strong out-of-distribution generalization.

Chenfeng Wang, Wei He, Xuhan Zhu, Chunpeng Zhou, Qizhen Li, Song Yan, Yufei Zheng, Chengjun Yu, Fan Lu, Wei Zhai, Yang Cao, Pengfei Yu, Zheng-Jun Zha• 2026

Related benchmarks

Task	Dataset	Result
High-resolution perception	HR-Bench-4K	Overall Score75.5	126
Visual Perception and Reasoning	V*Bench	Attribute Score86.09	49
Multimodal Perception and Reasoning	MME-RealWorld-Lite	Overall Score59.87	21
Visual Reasoning	VisualPuzzles OOD (test)	Overall Accuracy34.42	8
Fine-grained High-Resolution Perception	HRBench8K	Overall Score67.63	7

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord