Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

GazeMoE: Perception of Gaze Target with Mixture-of-Experts

About

Estimating human gaze target from visible images is a critical task for robots to understand human attention, yet the development of generalizable neural architectures and training paradigms remains challenging. While recent advances in pre-trained vision foundation models offer promising avenues for locating gaze targets, the integration of multi-modal cues -- including eyes, head poses, gestures, and contextual features -- demands adaptive and efficient decoding mechanisms. Inspired by Mixture-of-Experts (MoE) for adaptive domain expertise in large vision-language models, we propose GazeMoE, a novel end-to-end framework that selectively leverages gaze-target-related cues from a frozen foundation model through MoE modules. To address class imbalance in gaze target classification (in-frame vs. out-of-frame) and enhance robustness, GazeMoE incorporates a class-balancing auxiliary loss alongside strategic data augmentations, including region-specific cropping and photometric transformations. Extensive experiments on benchmark datasets demonstrate that our GazeMoE achieves state-of-the-art performance, outperforming existing methods on challenging gaze estimation tasks. The code and pre-trained models are released at https://huggingface.co/zdai257/GazeMoE

Zhuangzhuang Dai, Zhongxi Lu, Vincent G. Zakka, Luis J. Manso, Jose M Alcaraz Calero, Chen Li• 2026

Related benchmarks

TaskDatasetResultRank
Gaze target estimationGazeFollow
AUC0.959
45
Gaze target estimationVideoAttentionTarget
L2 Distance0.097
39
Gaze target estimationChildPlay (test)
AUC94.5
11
Gaze target estimationGazeFollow360
Spherical Distance0.771
10
Gaze target estimationEYEDIAP
AUC61.8
5
Gaze target estimationGazeFollow
Latency (ms)74.2
3
Showing 6 of 6 rows

Other info

Follow for update