EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields

About

Pretrained video diffusion models provide powerful spatiotemporal generative priors, making them a natural foundation for robotic world models. While recent world-action models jointly optimize future videos and actions, they predominantly treat video generation as an auxiliary representation for policy learning. Consequently, they insufficiently explore the inverse problem: leveraging action signals to guide video synthesis, thereby often failing to preserve precise robot spatial geometry and fine-grained robot-object interaction dynamics in the generated rollouts. To bridge this gap, we present EA-WM, an Event-Aware Generative World Model that effectively closes the loop between kinematic control and visual perception. Rather than injecting joint or end-effector actions as abstract, low-dimensional tokens, EA-WM projects actions and kinematic states directly into the target camera view as Structured Kinematic-to-Visual Action Fields. To fully exploit this geometrically grounded representation, we introduce event-aware bidirectional fusion blocks that modulate cross-branch attention, capturing object state changes and interaction dynamics. Evaluated on the comprehensive WorldArena benchmark, EA-WM achieves state-of-the-art performance, outperforming existing baselines by a significant margin.

Zhaoyang Yang, Yurun Jin, Lizhe Qi, Cong Huang, Kai Chen• 2026

Related benchmarks

Task	Dataset	Result
World Modeling	WorldArena (test)	Image Quality36.4	15
Video Generation	WorldArena	Interaction Quality68.2	14
Embodied World Modeling	WorldArena Robotwin	Interaction Quality Score0.682	9
Video Perception	WorldArena	Img Score0.364	5

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord