Whole-Body Conditioned Egocentric Video Prediction

About

We train models to Predict Ego-centric Video from human Actions (PEVA), given the past video and an action represented by the relative 3D body pose. By conditioning on kinematic pose trajectories, structured by the joint hierarchy of the body, our model learns to simulate how physical human actions shape the environment from a first-person point of view. We train an auto-regressive conditional diffusion transformer on Nymeria, a large-scale dataset of real-world egocentric video and body pose capture. We further design a hierarchical evaluation protocol with increasingly challenging tasks, enabling a comprehensive analysis of the model's embodied prediction and control abilities. Our work represents an initial attempt to tackle the challenges of modeling complex real-world environments and embodied agent behaviors with video prediction from the perspective of a human.

Yutong Bai, Danny Tran, Amir Bar, Yann LeCun, Trevor Darrell, Jitendra Malik• 2025

Related benchmarks

Task	Dataset	Result
Egocentric latent state prediction	HOMAGE	L2 Distance (2s)0.109	7
Egocentric latent state prediction	LEMMA	L2 Error (2s)0.109	7
Egocentric latent state prediction	Ego-Exo4D Bike	L2 Distance (2s)0.103	7
Egocentric latent state prediction	Ego-Exo4D Cooking	L2 Error (2s)0.102	7
Search-based planning	Nymeria evaluation set (test)	Leaf MJE0.637	6
Open-loop trajectory prediction	EgoDex (test)	Embedding L2 Error (At 4s)0.62	3
Egocentric Video Generation	Nymeria (PEVA/EgoControl)	LPIPS29.8	3

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord