Emergence of Fixational and Saccadic Movements in a Multi-Level Recurrent Attention Model for Vision

About

Inspired by foveal vision, hard attention models promise interpretability and parameter economy. However, existing models like the Recurrent Model of Visual Attention (RAM) and Deep Recurrent Attention Model (DRAM) failed to model the hierarchy of human vision system, that compromise on the visual exploration dynamics. As a result, they tend to produce attention that are either overly fixational or excessively saccadic, diverging from human eye movement behavior. In this paper, we propose a Multi-Level Recurrent Attention Model (MRAM), a novel hard attention framework that explicitly models the neural hierarchy of human visual processing. By decoupling the function of glimpse location generation and task execution in two recurrent layers, MRAM emergent a balanced behavior between fixation and saccadic movement. Our results show that MRAM not only achieves more human-like attention dynamics, but also consistently outperforms CNN, RAM and DRAM baselines on standard image classification benchmarks.

Pengcheng Pan, Yonekura Shogo, Yasuo Kuniyoshi• 2025

Related benchmarks

Task	Dataset	Result
Image Classification	CIFAR-10	Accuracy64.18	973
Image Classification	ImageNet100 (test)	Top-1 Acc12.88	87
Classification	COCO	Accuracy35.86	31
Scanpath Alignment	CIFAR-10	DTW930.4	18
Visual Search	COCO-Search18 cross-task	Accuracy (%)14.17	7

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord