UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs

About

Multimodal large language models are increasingly expected to perform thinking with images, yet existing visual latent reasoning methods still rely on explicit textual chain-of-thought interleaved with visual latent tokens. This interleaved design limits efficiency and keeps reasoning fragmented across separate text and vision channels. We propose UniVLR, a unified visual latent reasoning framework that treats textual reasoning and auxiliary visual evidence as a shared visual workspace. Instead of preserving text CoT as an independent inference-time path, UniVLR renders reasoning traces together with auxiliary images and learns to compress this unified representation into compact visual latent tokens. At inference time, the model reasons only through visual latents and directly decodes the final answer, avoiding both external tool calls and verbose text reasoning. Experiments on real-world perception and visual reasoning tasks show that UniVLR outperforms prior visual latent reasoning methods while using substantially fewer generated reasoning tokens, suggesting a more unified and efficient paradigm for visual thinking in MLLMs.

Houcheng Jiang, Jiajun Fu, Junfeng Fang, Chen Gao, Xiang Wang, Xiangnan He, Yong Li• 2026

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	--	2056
Visual Question Answering	TextVQA	TextVQA Accuracy80.4	210
Visual Perception and Reasoning	V*	Overall Accuracy82.7	60
Visual Reasoning	HR-Bench-4K	Overall Score0.733	54
Visual Reasoning	HR-Bench-8K	Overall Score68.8	42
General Perception and Reasoning	MME-RealWorld-Lite	Overall Accuracy50.7	21
Mathematical Reasoning	WeMath Loose	Accuracy49.4	6
Multimodal Evaluation	MME translation	MME Translation Score200	6
Efficiency Evaluation	HRBench8K	Average Time per Sample (s)4.5	5
Efficiency Evaluation	MME-RealWorld-Lite	Average Time per Sample (s)3.2	5

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord