Do MLLMs Really See It: Reinforcing Visual Attention in Multimodal LLMs

About

While chain-of-thought (CoT) reasoning has substantially improved multimodal large language models (MLLMs) on complex reasoning tasks, existing approaches largely rely on long textual reasoning trajectories and provide limited mechanisms for learning stable visual attention policies. Our analysis shows that current MLLMs exhibit weak visual focus: early-stage visual misalignment is rarely corrected during subsequent reasoning, leading to error propagation and failed inferences. We argue that this limitation stems from inadequate credit assignment for visual attention during training. To address this issue, we propose SAYO, a visual reasoning model trained with a reinforcement learning (RL) framework that introduces a region-level visual attention-based reward. This reward explicitly aligns optimization signals with visually grounded reasoning steps, enabling the model to learn more reliable attention behaviors. Extensive experiments across multiple multimodal benchmarks demonstrate that SAYO consistently improves performance on diverse reasoning and perception tasks.

Siqu Ou, Tianrui Wan, Zhiyuan Zhao, Junyu Gao, Xuelong Li• 2026

Related benchmarks

Task	Dataset	Result
Diagram Understanding	AI2D	Accuracy83.06	317
Mathematical Reasoning	WeMath	Accuracy64.83	225
Chart Understanding	ChartQA	Accuracy82.28	159
Visual Reasoning	V*Bench	Accuracy83.25	62
General Visual Reasoning	MMStar	Accuracy65.27	46
General Visual Reasoning	MME-RealWorld-Lite	Accuracy62.85	37
Mathematical Reasoning	MATH-Vision	Accuracy25.26	32
General Visual Reasoning	M3CoT	Accuracy68.46	17
Scientific Figure Reasoning	CharXiv	Accuracy43.2	17

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord