Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation

About

We introduce Genie Envisioner (GE), a unified world foundation platform for robotic manipulation that integrates policy learning, evaluation, and simulation within a single video-generative framework. At its core, GE-Base is a large-scale, instruction-conditioned video diffusion model that captures the spatial, temporal, and semantic dynamics of real-world robotic interactions in a structured latent space. Built upon this foundation, GE-Act maps latent representations to executable action trajectories through a lightweight, flow-matching decoder, enabling precise and generalizable policy inference across diverse embodiments with minimal supervision. To support scalable evaluation and training, GE-Sim serves as an action-conditioned neural simulator, producing high-fidelity rollouts for closed-loop policy development. The platform is further equipped with EWMBench, a standardized benchmark suite measuring visual fidelity, physical consistency, and instruction-action alignment. Together, these components establish Genie Envisioner as a scalable and practical foundation for instruction-driven, general-purpose embodied intelligence. All code, models, and benchmarks will be released publicly.

Yue Liao, Pengfei Zhou, Siyuan Huang, Donglin Yang, Shengcong Chen, Yuxin Jiang, Yue Hu, Jingbin Cai, Si Liu, Jianlan Luo, Liliang Chen, Shuicheng Yan, Maoqing Yao, Guanghui Ren• 2025

Related benchmarks

TaskDatasetResultRank
Robot ManipulationLIBERO
Goal Achievement95.8
700
Video GenerationWorldArena
Interaction Quality19.8
9
Real-robot ManipulationReal-robot Seen tasks
Stack Bowl Success Rate50
6
Real-robot ManipulationReal-robot Novel tasks OOD
Place Block Success Rate5
5
Cucumber PeelingReal-world visuo-tactile dataset
Success Rate0.00e+0
4
Whiteboard WipingReal-world visuo-tactile dataset
Success Rate2.5
4
Dynamics ModelingBridge dataset (Experiment #2)
PSNR21.47
4
Dynamics ModelingReal-world tasks Experiment #1
PSNR21.16
4
Potato Chip Pick-and-PlaceReal-world visuo-tactile dataset
Success Rate0.00e+0
4
Action-Conditioned Generationaction-to-video dataset
PSNR18.05
3
Showing 10 of 10 rows

Other info

Follow for update