Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

LaRe: Latent Refocusing for Multimodal Reasoning

About

Chain of Thought (CoT) reasoning enhances logical performance by decomposing complex tasks, yet its multimodal extension faces a trade-off. The prevailing Thinking with Images paradigm achieves visual refocusing by explicitly cropping image regions, yet incurs rapidly growing computational overhead. The emerging line of latent-space reasoning reduces token consumption, but lacks the capacity for dynamic refocusing. We argue that this trade-off stems from a tacitly accepted premise that effective visual refocusing must occur in the form of explicit tokens. Building on this, we propose Latent Refocusing (LaRe), a new multimodal reasoning paradigm in which visual refocusing takes place entirely within the latent space. We further design a semantic augmentation training strategy that ensures the semantic structure of the latent space through visual reconstruction objective. Experimental evaluations demonstrate that LaRe improves average accuracy by 7.6% compared to existing baselines while reducing the number of tokens required for inference by 59.7%. When scaled to a 8B-parameter Vision-Language Model backbone, LaRe achieves performance comparable to state-of-the-art methods, demonstrating the efficacy of our proposed latent refocusing paradigm for multimodal reasoning.

Jizheng Ma, Xiaofei Zhou, Geyuan Zhang, Yanlong Song, Han Yan• 2025

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE--
2019
Multimodal UnderstandingMMBench
Accuracy87.4
847
Science Question AnsweringScienceQA--
791
Visual PerceptionMMVP
Accuracy74
118
Multimodal ReasoningMMStar
Accuracy77.1
78
Multimodal ReasoningScienceQA
Average Accuracy87.5
45
Multimodal ReasoningMMStar
Accuracy72.8
25
Expert-level Multimodal UnderstandingMMMU-Pro
Accuracy67.1
20
Multimodal ReasoningMMBench
Accuracy82.1
16
Multimodal ReasoningMMVP
Accuracy57.6
16
Showing 10 of 14 rows

Other info

Follow for update