Learning to See What You Need: Gaze Attention for Multimodal Large Language Models

About

When humans describe a visual scene, they do not process the entire image uniformly; instead, they selectively fixate on regions relevant to their intended description. In contrast, current multimodal large language models (MLLMs) attend to all visual tokens at each generation step, leading to diluted focus and unnecessary computational overhead. In this work, we introduce Gaze Attention, a novel mechanism that enables MLLMs to selectively attend to task-relevant visual regions during generation. Specifically, we spatially group visual embeddings-stored as key-value caches-into compact gaze regions, each represented by a lightweight descriptor. At each decoding step, the model dynamically selects the most relevant regions and restricts attention to them, reducing redundant computation while enhancing focus. To mitigate the loss of global context caused by localized attention, we further propose learnable context tokens appended to each image or frame, allowing the model to maintain holistic visual awareness. Extensive experiments on image and video understanding benchmarks demonstrate that Gaze Attention matches or surpasses dense-attention baselines, while using up to 90% fewer visual KV entries in the attention computation.

Junha Song, Byeongho Heo, Geonmo Gu, Jaegul Choo, Dongyoon Han, Sangdoo Yun• 2026

Related benchmarks

Task	Dataset	Result
Visual Question Answering	GQA	Accuracy58.2	524
Optical Character Recognition	OCRBench	--	486
Visual Question Answering	RealworldQA	Accuracy61.7	327
Text-based Visual Question Answering	TextVQA	TextVQA Accuracy66.2	141
Visual Question Answering	MMVP	Accuracy47.3	82
Science Question Answering	ScienceQA image	Score82.1	70
Video Question Answering	VideoMME (test)	--	61
Mathematical Reasoning	MathVista	MathVista31.1	55
Multimodal Knowledge and Math	MMMU (val)	Accuracy47.4	33
Chart Question Answering	ChartQA	ChartQA Score64.9	28

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord