LaRe: Latent Refocusing for Multimodal Reasoning

About

Chain of Thought (CoT) reasoning enhances logical performance by decomposing complex tasks, yet its multimodal extension faces a trade-off. The prevailing Thinking with Images paradigm achieves visual refocusing by explicitly cropping image regions, yet incurs rapidly growing computational overhead. The emerging line of latent-space reasoning reduces token consumption, but lacks the capacity for dynamic refocusing. We argue that this trade-off stems from a tacitly accepted premise that effective visual refocusing must occur in the form of explicit tokens. Building on this, we propose Latent Refocusing (LaRe), a new multimodal reasoning paradigm in which visual refocusing takes place entirely within the latent space. We further design a semantic augmentation training strategy that ensures the semantic structure of the latent space through visual reconstruction objective. Experimental evaluations demonstrate that LaRe improves average accuracy by 7.6% compared to existing baselines while reducing the number of tokens required for inference by 59.7%. When scaled to a 8B-parameter Vision-Language Model backbone, LaRe achieves performance comparable to state-of-the-art methods, demonstrating the efficacy of our proposed latent refocusing paradigm for multimodal reasoning.

Jizheng Ma, Xiaofei Zhou, Geyuan Zhang, Yanlong Song, Han Yan• 2025

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	--	2056
Science Question Answering	ScienceQA	Accuracy72.6	916
Multimodal Understanding	MMBench	Accuracy87.4	887
Visual Question Answering	VQA v2	Accuracy75	257
Visual Question Answering	VQA v2 (test)	--	142
Visual Perception	MMVP	Accuracy74	118
Multimodal Reasoning	MMStar	Accuracy77.1	102
Multimodal Reasoning	ScienceQA	Average Accuracy87.5	45
Multimodal Reasoning	MMMU-Pro	Accuracy58.4	33
Multimodal Reasoning	MMStar	Accuracy72.8	25

Showing 10 of 18 rows

Other info

Follow for update

@wizwand_team Discord