EgoControl: Controllable Egocentric Video Generation via 3D Full-Body Poses
About
Egocentric video generation with fine-grained control through body motion is a key requirement towards embodied AI agents that can simulate, predict, and plan actions. In this work, we propose EgoControl, a pose-controllable video diffusion model trained on egocentric data. We train a video prediction model to condition future frame generation on explicit 3D body pose sequences. To achieve precise motion control, we introduce a novel pose representation that captures both global camera dynamics and articulated body movements, and integrate it through a dedicated control mechanism within the diffusion process. Given a short sequence of observed frames and a sequence of target poses, EgoControl generates temporally coherent and visually realistic future frames that align with the provided pose control. Experimental results demonstrate that EgoControl produces high-quality, pose-consistent egocentric videos, paving the way toward controllable embodied video simulation and understanding.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Egocentric latent state prediction | HOMAGE | L2 Distance (2s)0.099 | 7 | |
| Egocentric latent state prediction | LEMMA | L2 Error (2s)0.091 | 7 | |
| Egocentric latent state prediction | Ego-Exo4D Bike | L2 Distance (2s)0.085 | 7 | |
| Egocentric latent state prediction | Ego-Exo4D Cooking | L2 Error (2s)0.09 | 7 | |
| Egocentric Video Generation | Nymeria (PEVA/EgoControl) | LPIPS24.3 | 3 |