Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Emergence of Fixational and Saccadic Movements in a Multi-Level Recurrent Attention Model for Vision

About

Inspired by foveal vision, hard attention models promise interpretability and parameter economy. However, existing models like the Recurrent Model of Visual Attention (RAM) and Deep Recurrent Attention Model (DRAM) failed to model the hierarchy of human vision system, that compromise on the visual exploration dynamics. As a result, they tend to produce attention that are either overly fixational or excessively saccadic, diverging from human eye movement behavior. In this paper, we propose a Multi-Level Recurrent Attention Model (MRAM), a novel hard attention framework that explicitly models the neural hierarchy of human visual processing. By decoupling the function of glimpse location generation and task execution in two recurrent layers, MRAM emergent a balanced behavior between fixation and saccadic movement. Our results show that MRAM not only achieves more human-like attention dynamics, but also consistently outperforms CNN, RAM and DRAM baselines on standard image classification benchmarks.

Pengcheng Pan, Yonekura Shogo, Yasuo Kuniyoshi• 2025

Related benchmarks

TaskDatasetResultRank
Image ClassificationCIFAR-10
Accuracy64.18
508
Image ClassificationImageNet100 (test)
Top-1 Acc12.88
87
ClassificationCOCO
Accuracy35.86
31
Scanpath AlignmentCIFAR-10
DTW930.4
18
Visual SearchCOCO-Search18 cross-task
Accuracy (%)14.17
7
Showing 5 of 5 rows

Other info

Follow for update