Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

DeepLatent: Think with Images via Parallel Latent Visual Reasoning

About

The emerging paradigm of "thinking with images" embeds visual states into intermediate reasoning steps, defining a new frontier for Vision-Language Models. Existing approaches diverge along two lines. Tool-assisted methods apply explicit visual operations but suffer from high latency and restricted manipulation types. Latent reasoning methods autoregressively produce implicit visual states, but underperform tool-assisted methods, and their latent tokens fail to capture effective visual information. In this work, we propose DeepLatent, a parallel framework for latent visual reasoning. First, we introduce LatentFormer. It uses learnable 2D tokens to generate context-conditioned latent states in parallel, anchoring every visual update directly in the original image features. Second, we design a continuous-space reinforcement learning algorithm. It optimizes latent modulation parameters directly in the embedding space, significantly improving latent representation quality. The framework is trained via knowledge distillation followed by this continuous-space RL algorithm. Furthermore, we contribute DeepLatent-180K, a large-scale dataset tailored for latent visual reasoning. Extensive evaluations across multiple benchmarks demonstrate that DeepLatent achieves state-of-the-art performance.

Dongchen Lu, Zhimo Li, Mao Shu, Huo Cao• 2026

Related benchmarks

TaskDatasetResultRank
OCR EvaluationOCRBench
Score86.4
350
High-resolution perceptionV*
Overall Score85.3
55
Visual ReasoningJigsaw
Accuracy68
40
General TaskMMStar
Accuracy65
36
High-resolution perceptionHR-Bench-8K
Score74.1
32
Visual ReasoningVSP
Accuracy83.7
17
Visual ReasoningVisuLogic
Avg Score25.3
16
High-resolution perceptionHR4K
Overall Score77.9
13
Visual ReasoningBabyVision
Accuracy16.2
12
High-resolution perceptionMME-RW
Score64.2
10
Showing 10 of 16 rows

Other info

Follow for update