Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Learning to See What You Need: Gaze Attention for Multimodal Large Language Models

About

When humans describe a visual scene, they do not process the entire image uniformly; instead, they selectively fixate on regions relevant to their intended description. In contrast, current multimodal large language models (MLLMs) attend to all visual tokens at each generation step, leading to diluted focus and unnecessary computational overhead. In this work, we introduce Gaze Attention, a novel mechanism that enables MLLMs to selectively attend to task-relevant visual regions during generation. Specifically, we spatially group visual embeddings-stored as key-value caches-into compact gaze regions, each represented by a lightweight descriptor. At each decoding step, the model dynamically selects the most relevant regions and restricts attention to them, reducing redundant computation while enhancing focus. To mitigate the loss of global context caused by localized attention, we further propose learnable context tokens appended to each image or frame, allowing the model to maintain holistic visual awareness. Extensive experiments on image and video understanding benchmarks demonstrate that Gaze Attention matches or surpasses dense-attention baselines, while using up to 90% fewer visual KV entries in the attention computation.

Junha Song, Byeongho Heo, Geonmo Gu, Jaegul Choo, Dongyoon Han, Sangdoo Yun• 2026

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringGQA
Accuracy58.2
524
Optical Character RecognitionOCRBench--
433
Visual Question AnsweringRealworldQA
Accuracy61.7
259
Visual Question AnsweringMMVP
Accuracy47.3
82
Science Question AnsweringScienceQA image
Score82.1
70
Video Question AnsweringVideoMME (test)--
61
Mathematical ReasoningMathVista
MathVista31.1
55
Text-based Visual Question AnsweringTextVQA
TextVQA Accuracy66.2
33
Multimodal Knowledge and MathMMMU (val)
Accuracy47.4
33
Image Question AnsweringMME Perception
MME-P Score1.59e+3
23
Showing 10 of 13 rows

Other info

Follow for update