Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

BridgeV2W: Bridging Video Generation Models to Embodied World Models via Embodiment Masks

About

Embodied world models have emerged as a promising paradigm in robotics, most of which leverage large-scale Internet videos or pretrained video generation models to enrich visual and motion priors. However, they still face key challenges: a misalignment between coordinate-space actions and pixel-space videos, sensitivity to camera viewpoint, and non-unified architectures across embodiments. To this end, we present BridgeV2W, which converts coordinate-space actions into pixel-aligned embodiment masks rendered from the URDF and camera parameters. These masks are then injected into a pretrained video generation model via a ControlNet-style pathway, which aligns the action control signals with predicted videos, adds view-specific conditioning to accommodate camera viewpoints, and yields a unified world model architecture across embodiments. To mitigate overfitting to static backgrounds, BridgeV2W further introduces a flow-based motion loss that focuses on learning dynamic and task-relevant regions. Experiments on single-arm (DROID) and dual-arm (AgiBot-G1) datasets, covering diverse and challenging conditions with unseen viewpoints and scenes, show that BridgeV2W improves video generation quality compared to prior state-of-the-art methods. We further demonstrate the potential of BridgeV2W on downstream real-world tasks, including policy evaluation and goal-conditioned planning. More results can be found on our project website at https://BridgeV2W.github.io .

Yixiang Chen, Peiyan Li, Jiabing Yang, Keji He, Xiangnan Wu, Yuan Xu, Kai Wang, Jing Liu, Nianfeng Liu, Yan Huang, Liang Wang• 2026

Related benchmarks

TaskDatasetResultRank
Video GenerationDROID (In-Domain)
PSNR22.89
4
Video GenerationDROID Unseen Camera Viewpoint
PSNR20.87
4
Video GenerationDROID (Unseen Scene)
PSNR19.73
4
Video GenerationAgiBot-G1
PSNR24.49
4
All TasksReal-world 1.0 (test)
Success Rate45
2
Flip CupReal-world 1.0 (test)
Success Rate30
2
Put on plateReal-world 1.0 (test)
Success Rate50
2
Close DrawerReal-world 1.0 (test)
Success Rate50
2
Put in shelfReal-world 1.0 (test)
Success Rate50
2
Showing 9 of 9 rows

Other info

Follow for update