GazeMoE: Perception of Gaze Target with Mixture-of-Experts

About

Estimating human gaze target from visible images is a critical task for robots to understand human attention, yet the development of generalizable neural architectures and training paradigms remains challenging. While recent advances in pre-trained vision foundation models offer promising avenues for locating gaze targets, the integration of multi-modal cues -- including eyes, head poses, gestures, and contextual features -- demands adaptive and efficient decoding mechanisms. Inspired by Mixture-of-Experts (MoE) for adaptive domain expertise in large vision-language models, we propose GazeMoE, a novel end-to-end framework that selectively leverages gaze-target-related cues from a frozen foundation model through MoE modules. To address class imbalance in gaze target classification (in-frame vs. out-of-frame) and enhance robustness, GazeMoE incorporates a class-balancing auxiliary loss alongside strategic data augmentations, including region-specific cropping and photometric transformations. Extensive experiments on benchmark datasets demonstrate that our GazeMoE achieves state-of-the-art performance, outperforming existing methods on challenging gaze estimation tasks. The code and pre-trained models are released at https://huggingface.co/zdai257/GazeMoE

Zhuangzhuang Dai, Zhongxi Lu, Vincent G. Zakka, Luis J. Manso, Jose M Alcaraz Calero, Chen Li• 2026

Related benchmarks

Task	Dataset	Result
Gaze target estimation	GazeFollow	Avg L2 Distance0.101	48
Gaze target estimation	VideoAttentionTarget	L2 Distance0.097	39
Gaze target estimation	ChildPlay (test)	AUC94.5	11
Gaze target estimation	GazeFollow360	Spherical Distance0.771	10
Gaze target estimation	EYEDIAP	AUC61.8	5
Gaze target estimation	GazeFollow	Latency (ms)74.2	3

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord