EgoEnv: Human-centric environment representations from egocentric video

About

First-person video highlights a camera-wearer's activities in the context of their persistent environment. However, current video understanding approaches reason over visual features from short video clips that are detached from the underlying physical space and capture only what is immediately visible. To facilitate human-centric environment understanding, we present an approach that links egocentric video and the environment by learning representations that are predictive of the camera-wearer's (potentially unseen) local surroundings. We train such models using videos from agents in simulated 3D environments where the environment is fully observable, and test them on human-captured real-world videos from unseen environments. On two human-centric video tasks, we show that models equipped with our environment-aware features consistently outperform their counterparts with traditional clip features. Moreover, despite being trained exclusively on simulated videos, our approach successfully handles real-world videos from HouseTours and Ego4D, and achieves state-of-the-art results on the Ego4D NLQ challenge. Project page: https://vision.cs.utexas.edu/projects/ego-env/

Tushar Nagarajan, Santhosh Kumar Ramakrishnan, Ruta Desai, James Hillis, Kristen Grauman• 2022

Related benchmarks

Task	Dataset	Result
Natural Language Query	MP3D 6 (val)	Rank-1 Success @ 0.338.18	8
Natural Language Query	HouseTours 7 (val)	Rank@1 (Thresh 0.3)51.98	8
Room Prediction	MP3D (val)	Accuracy50.4	8
Room Prediction	HouseTours (val)	Accuracy62.68	8
Natural Language Query	Ego4D 26 (val)	Rank-1 @ IoU 0.36.04	7
Room Prediction	Ego4D	Accuracy51.07	7
Natural Language Queries	Ego4D NLQ v2 (val)	R@1 (IoU=0.3)25.37	7
Natural Language Queries	Ego4D-NLQ v2 (test)	Recall@1 (IoU=0.3)23.28	7
Natural Language Queries	Ego4D NLQ (challenge)	R@0.323.28	5

Showing 9 of 9 rows

Other info

Code

Follow for update

@wizwand_team Discord