Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model
About
In language reasoning, longer chains of thought consistently yield better performance, which naturally suggests that visual latent reasoning may likewise benefit from longer latent sequences. However, we discover a counterintuitive phenomenon: the performance of existing latent visual reasoning methods systematically degrades as the latent sequence grows longer. We reveal the root cause: Information Gain Collapse -- autoregressive generation makes each step highly dependent on prior outputs, so subsequent tokens can barely introduce new information. We further identify that heavily pooled ($\geq 128\times$) image embeddings used as supervision targets provide no more signal than meaningless placeholders. Motivated by these insights, we propose SCOLAR (Self-COnsistent LAtent Reasoning), which introduces a lightweight detransformer that leverages the LLM's full-sequence hidden states to generate auxiliary visual tokens in a single shot, with each token independently anchored to the original visual space. Combined with three-stage SFT and ALPO reinforcement learning, SCOLAR extends acceptable latent CoT length by over $30\times$, achieves state-of-the-art among open-source models on real-world reasoning benchmarks (+14.12% over backbone), and demonstrates strong out-of-distribution generalization.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| High-resolution perception | HR-Bench-4K | Overall Score75.5 | 103 | |
| Visual Perception and Reasoning | V*Bench | Attribute Score86.09 | 49 | |
| Visual Reasoning | VisualPuzzles OOD (test) | Overall Accuracy34.42 | 8 | |
| Multimodal Perception and Reasoning | MME-RealWorld-Lite | Overall Score59.87 | 7 | |
| Fine-grained High-Resolution Perception | HRBench8K | Overall Score67.63 | 7 |