DeepLatent: Think with Images via Parallel Latent Visual Reasoning
About
The emerging paradigm of "thinking with images" embeds visual states into intermediate reasoning steps, defining a new frontier for Vision-Language Models. Existing approaches diverge along two lines. Tool-assisted methods apply explicit visual operations but suffer from high latency and restricted manipulation types. Latent reasoning methods autoregressively produce implicit visual states, but underperform tool-assisted methods, and their latent tokens fail to capture effective visual information. In this work, we propose DeepLatent, a parallel framework for latent visual reasoning. First, we introduce LatentFormer. It uses learnable 2D tokens to generate context-conditioned latent states in parallel, anchoring every visual update directly in the original image features. Second, we design a continuous-space reinforcement learning algorithm. It optimizes latent modulation parameters directly in the embedding space, significantly improving latent representation quality. The framework is trained via knowledge distillation followed by this continuous-space RL algorithm. Furthermore, we contribute DeepLatent-180K, a large-scale dataset tailored for latent visual reasoning. Extensive evaluations across multiple benchmarks demonstrate that DeepLatent achieves state-of-the-art performance.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| OCR Evaluation | OCRBench | Score86.4 | 350 | |
| High-resolution perception | V* | Overall Score85.3 | 55 | |
| Visual Reasoning | Jigsaw | Accuracy68 | 40 | |
| General Task | MMStar | Accuracy65 | 36 | |
| High-resolution perception | HR-Bench-8K | Score74.1 | 32 | |
| Visual Reasoning | VSP | Accuracy83.7 | 17 | |
| Visual Reasoning | VisuLogic | Avg Score25.3 | 16 | |
| High-resolution perception | HR4K | Overall Score77.9 | 13 | |
| Visual Reasoning | BabyVision | Accuracy16.2 | 12 | |
| High-resolution perception | MME-RW | Score64.2 | 10 |