Semantic-Enriched Latent Visual Reasoning
About
Multimodal latent-space reasoning aims to replace explicit thinking with images by performing visual reasoning directly in a compact latent space. However, existing approaches largely rely on visual supervision and produce latent representations that lack sufficient semantic richness, limiting their ability to support diverse region-level reasoning tasks. In this work, we introduce Semantic-Enriched Latent Visual Reasoning (SLVR), a two-stage learning framework that enriches latent representations with attribute-level visual semantics and aligns them with diverse reasoning objectives. In the first stage, SLVR learns semantically enriched region-centric latents under fine-grained attribute supervision. In the second stage, we design Multi-query Group Relative Policy Optimization (M-GRPO) to align latent representations across multiple queries grounded in the same region. To support this framework, we construct SLV-Set, comprising approximately 400K region-level attribute annotations and 800K multi-query question answering samples, and introduce SV-QA, a benchmark that evaluates latent reasoning under semantic variation. Experiments demonstrate that SLVR improves the robustness and semantic consistency of latent visual reasoning compared to existing baselines.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | VizWiz | Accuracy57.8 | 1820 | |
| Visual Question Answering | ChartQA | Accuracy77.2 | 519 | |
| Visual Question Answering | AI2D | Accuracy76 | 317 | |
| Visual Question Answering | TextVQA | TextVQA Accuracy79.3 | 210 | |
| Visual Question Answering | GQA | Accuracy55.6 | 155 | |
| Visual Question Answering | OKVQA | Accuracy61.8 | 26 | |
| Visual Question Answering | SV-QA (V*) | Q1 Accuracy82.2 | 6 | |
| Visual Question Answering | SV-QA HRBench-4K | Q1 Accuracy70.4 | 6 | |
| Visual Question Answering | SV-QA HRBench-8K | Accuracy (Q1)62.5 | 6 | |
| Visual Reasoning | VisualPuzzles | Algorithmic Score37.4 | 3 |