Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders

About

We address the problem of gaze target estimation, which aims to predict where a person is looking in a scene. Predicting a person's gaze target requires reasoning both about the person's appearance and the contents of the scene. Prior works have developed increasingly complex, hand-crafted pipelines for gaze target estimation that carefully fuse features from separate scene encoders, head encoders, and auxiliary models for signals like depth and pose. Motivated by the success of general-purpose feature extractors on a variety of visual tasks, we propose Gaze-LLE, a novel transformer framework that streamlines gaze target estimation by leveraging features from a frozen DINOv2 encoder. We extract a single feature representation for the scene, and apply a person-specific positional prompt to decode gaze with a lightweight module. We demonstrate state-of-the-art performance across several gaze benchmarks and provide extensive analysis to validate our design choices. Our code is available at: http://github.com/fkryan/gazelle .

Fiona Ryan, Ajay Bati, Sangmin Lee, Daniel Bolya, Judy Hoffman, James M. Rehg• 2024

Related benchmarks

Task	Dataset	Result
Gaze target estimation	GazeFollow	Avg L2 Distance0.099	48
Gaze target estimation	VideoAttentionTarget	L2 Distance0.103	39
Gaze Following	VideoAttentionTarget	L2 Distance0.103	38
Gaze Following	GazeFollowing	Minimum Distance0.041	35
Gaze Following	VideoAttentionTarget (test)	AUC0.937	20
Gaze Following	ChildPlay	Distance0.101	17
Gaze target estimation	ChildPlay (test)	AUC95.1	11
Gaze target estimation	GazeFollow360	Spherical Distance0.759	10
Gaze Following	VideoAttentionTarget (VAT) Consistent	AUC0.9528	9
Gaze Following	VideoAttentionTarget (VAT) Inconsistent	AUC0.9059	9

Showing 10 of 19 rows

Other info

Code

Follow for update

@wizwand_team Discord