EgoControl: Controllable Egocentric Video Generation via 3D Full-Body Poses

About

Egocentric video generation with fine-grained control through body motion is a key requirement towards embodied AI agents that can simulate, predict, and plan actions. In this work, we propose EgoControl, a pose-controllable video diffusion model trained on egocentric data. We train a video prediction model to condition future frame generation on explicit 3D body pose sequences. To achieve precise motion control, we introduce a novel pose representation that captures both global camera dynamics and articulated body movements, and integrate it through a dedicated control mechanism within the diffusion process. Given a short sequence of observed frames and a sequence of target poses, EgoControl generates temporally coherent and visually realistic future frames that align with the provided pose control. Experimental results demonstrate that EgoControl produces high-quality, pose-consistent egocentric videos, paving the way toward controllable embodied video simulation and understanding.

Enrico Pallotta, Sina Mokhtarzadeh Azar, Lars Doorenbos, Serdar Ozsoy, Umar Iqbal, Juergen Gall• 2025

Related benchmarks

Task	Dataset	Result
Egocentric latent state prediction	HOMAGE	L2 Distance (2s)0.099	7
Egocentric latent state prediction	LEMMA	L2 Error (2s)0.091	7
Egocentric latent state prediction	Ego-Exo4D Bike	L2 Distance (2s)0.085	7
Egocentric latent state prediction	Ego-Exo4D Cooking	L2 Error (2s)0.09	7
Egocentric Video Generation	Nymeria (PEVA/EgoControl)	LPIPS24.3	3

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord