Semantic-Enriched Latent Visual Reasoning

About

Multimodal latent-space reasoning aims to replace explicit thinking with images by performing visual reasoning directly in a compact latent space. However, existing approaches largely rely on visual supervision and produce latent representations that lack sufficient semantic richness, limiting their ability to support diverse region-level reasoning tasks. In this work, we introduce Semantic-Enriched Latent Visual Reasoning (SLVR), a two-stage learning framework that enriches latent representations with attribute-level visual semantics and aligns them with diverse reasoning objectives. In the first stage, SLVR learns semantically enriched region-centric latents under fine-grained attribute supervision. In the second stage, we design Multi-query Group Relative Policy Optimization (M-GRPO) to align latent representations across multiple queries grounded in the same region. To support this framework, we construct SLV-Set, comprising approximately 400K region-level attribute annotations and 800K multi-query question answering samples, and introduce SV-QA, a benchmark that evaluates latent reasoning under semantic variation. Experiments demonstrate that SLVR improves the robustness and semantic consistency of latent visual reasoning compared to existing baselines.

Tianrun Xu, Yue Sun, Qixun Wang, Jingyi Lu, Yuan Wang, Tianren Zhang, Longteng Guo, Fengyun Rao, Jing Lyu, Feng Chen, Jing Liu• 2026

Related benchmarks

Task	Dataset	Result
Visual Question Answering	VizWiz	Accuracy57.8	1863
Visual Question Answering	ChartQA	Accuracy77.2	620
Visual Question Answering	AI2D	Accuracy76	402
Visual Question Answering	GQA	Accuracy55.6	218
Visual Question Answering	TextVQA	TextVQA Accuracy79.3	210
Visual Question Answering	OKVQA	Accuracy61.8	34
Visual Reasoning	VisualPuzzles	Overall Score34.2	12
Visual Question Answering	SV-QA (V*)	Q1 Accuracy82.2	6
Visual Question Answering	SV-QA HRBench-4K	Q1 Accuracy70.4	6
Visual Question Answering	SV-QA HRBench-8K	Accuracy (Q1)62.5	6

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord