Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

A Modular Multimodal Architecture for Gaze Target Prediction: Application to Privacy-Sensitive Settings

About

Predicting where a person is looking is a complex task, requiring to understand not only the person's gaze and scene content, but also the 3D scene structure and the person's situation (are they manipulating? interacting or observing others? attentive?) to detect obstructions in the line of sight or apply attention priors that humans typically have when observing others. In this paper, we hypothesize that identifying and leveraging such priors can be better achieved through the exploitation of explicitly derived multimodal cues such as depth and pose. We thus propose a modular multimodal architecture allowing to combine these cues using an attention mechanism. The architecture can naturally be exploited in privacy-sensitive situations such as surveillance and health, where personally identifiable information cannot be released. We perform extensive experiments on the GazeFollow and VideoAttentionTarget public datasets, obtaining state-of-the-art performance and demonstrating very competitive results in the privacy setting case.

Anshul Gupta, Samy Tafasca, Jean-Marc Odobez• 2023

Related benchmarks

TaskDatasetResultRank
Gaze FollowingGazeFollow (test)
AUC0.943
24
Gaze FollowingVideoAttentionTarget (test)
AUC0.913
20
Gaze target estimationGazeFollow
AUC0.943
18
Gaze target estimationVideoAttentionTarget
L2 Distance0.11
15
Gaze FollowingVAT (test)
Distance Error0.134
11
Gaze following in videoVAT (test)
Distance Error0.134
11
Gaze FollowingChildPlay
Distance0.113
10
Social Gaze PredictionVideoCoAtt
F1 (LAH)81.5
7
Social Gaze PredictionUCO-LAEO
F1 Score (LAH)98.9
7
Social Gaze PredictionVSGaze
F1 (LAH)78.2
7
Showing 10 of 19 rows

Other info

Follow for update