LaRe: Latent Refocusing for Multimodal Reasoning
About
Chain of Thought (CoT) reasoning enhances logical performance by decomposing complex tasks, yet its multimodal extension faces a trade-off. The prevailing Thinking with Images paradigm achieves visual refocusing by explicitly cropping image regions, yet incurs rapidly growing computational overhead. The emerging line of latent-space reasoning reduces token consumption, but lacks the capacity for dynamic refocusing. We argue that this trade-off stems from a tacitly accepted premise that effective visual refocusing must occur in the form of explicit tokens. Building on this, we propose Latent Refocusing (LaRe), a new multimodal reasoning paradigm in which visual refocusing takes place entirely within the latent space. We further design a semantic augmentation training strategy that ensures the semantic structure of the latent space through visual reconstruction objective. Experimental evaluations demonstrate that LaRe improves average accuracy by 7.6% compared to existing baselines while reducing the number of tokens required for inference by 59.7%. When scaled to a 8B-parameter Vision-Language Model backbone, LaRe achieves performance comparable to state-of-the-art methods, demonstrating the efficacy of our proposed latent refocusing paradigm for multimodal reasoning.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Hallucination Evaluation | POPE | -- | 2019 | |
| Multimodal Understanding | MMBench | Accuracy87.4 | 847 | |
| Science Question Answering | ScienceQA | -- | 791 | |
| Visual Perception | MMVP | Accuracy74 | 118 | |
| Multimodal Reasoning | MMStar | Accuracy77.1 | 78 | |
| Multimodal Reasoning | ScienceQA | Average Accuracy87.5 | 45 | |
| Multimodal Reasoning | MMStar | Accuracy72.8 | 25 | |
| Expert-level Multimodal Understanding | MMMU-Pro | Accuracy67.1 | 20 | |
| Multimodal Reasoning | MMBench | Accuracy82.1 | 16 | |
| Multimodal Reasoning | MMVP | Accuracy57.6 | 16 |