OmniGF: A Dual-Branch Vision-Language Framework for Unified Gaze Following

About

Understanding human gaze behavior is essential for complex scene comprehension and human-computer interaction. Traditional gaze following models are typically restricted to pure spatial localization, lacking the high-level capacity to reason about semantic targets or complex social contexts. Furthermore, these models often process individuals sequentially, requiring redundant computations over the same scene image for multi-person inference. While recent Vision-Language Models (VLMs) offer the exceptional semantic reasoning needed to address gaze-related semantic tasks, their reliance on discrete text generation inherently limits precision in continuous spatial tasks like gaze localization. To bridge this gap, we propose OmniGF, a unified vision-language framework that adapts foundational VLMs for highly scalable multi-person gaze reasoning. The model adopts a dual-branch decoding strategy: a structured language branch generates discrete reasoning states, while a continuous spatial branch directly taps into the VLM's dense hidden states. Supervising these extracted representations with high-resolution gaze target heatmaps effectively overcomes the spatial bottleneck of text-only coordinate generation. Furthermore, to explicitly ground the model in multi-person scenes, we augment the input with head embeddings encoded from cropped head images, providing fine-grained appearance and orientation cues for all individuals simultaneously. By modeling all individuals and leveraging the strong semantic capability of VLMs, OmniGF seamlessly integrates precise spatial gaze target estimation, semantic gaze prediction, and complex social gaze reasoning. Extensive experiments demonstrate that our framework establishes new state-of-the-art performance across multiple standard benchmarks. Code is available at https://github.com/cvlab-stonybrook/omnigf.

Qiaomu Miao, Haoyu Wu, Jingyi Xu, Minh Hoai, Dimitris Samaras• 2026

Related benchmarks

Task	Dataset	Result
Gaze target estimation	GazeFollow	Avg L2 Distance0.092	67
Gaze Following	VideoAttentionTarget	L2 Distance0.096	38
Gaze Following	GazeFollowing	Minimum Distance0.04	35
Gaze Following	ChildPlay	Distance0.09	17
Social Gaze Prediction	VSGaze	F1 (LAH)81.4	12
Semantic Gaze Recognition	GazeFollow	Accuracy@164.9	3
Semantic Gaze Prediction	GazeHOI	GazeAcc78.6	2

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord