Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

EgoEnv: Human-centric environment representations from egocentric video

About

First-person video highlights a camera-wearer's activities in the context of their persistent environment. However, current video understanding approaches reason over visual features from short video clips that are detached from the underlying physical space and capture only what is immediately visible. To facilitate human-centric environment understanding, we present an approach that links egocentric video and the environment by learning representations that are predictive of the camera-wearer's (potentially unseen) local surroundings. We train such models using videos from agents in simulated 3D environments where the environment is fully observable, and test them on human-captured real-world videos from unseen environments. On two human-centric video tasks, we show that models equipped with our environment-aware features consistently outperform their counterparts with traditional clip features. Moreover, despite being trained exclusively on simulated videos, our approach successfully handles real-world videos from HouseTours and Ego4D, and achieves state-of-the-art results on the Ego4D NLQ challenge. Project page: https://vision.cs.utexas.edu/projects/ego-env/

Tushar Nagarajan, Santhosh Kumar Ramakrishnan, Ruta Desai, James Hillis, Kristen Grauman• 2022

Related benchmarks

TaskDatasetResultRank
Natural Language QueryMP3D 6 (val)
Rank-1 Success @ 0.338.18
8
Natural Language QueryHouseTours 7 (val)
Rank@1 (Thresh 0.3)51.98
8
Room PredictionMP3D (val)
Accuracy50.4
8
Room PredictionHouseTours (val)
Accuracy62.68
8
Natural Language QueryEgo4D 26 (val)
Rank-1 @ IoU 0.36.04
7
Room PredictionEgo4D
Accuracy51.07
7
Natural Language QueriesEgo4D NLQ v2 (val)
R@1 (IoU=0.3)25.37
7
Natural Language QueriesEgo4D-NLQ v2 (test)
Recall@1 (IoU=0.3)23.28
7
Natural Language QueriesEgo4D NLQ (challenge)
R@0.323.28
5
Showing 9 of 9 rows

Other info

Code

Follow for update