Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation
About
We introduce Genie Envisioner (GE), a unified world foundation platform for robotic manipulation that integrates policy learning, evaluation, and simulation within a single video-generative framework. At its core, GE-Base is a large-scale, instruction-conditioned video diffusion model that captures the spatial, temporal, and semantic dynamics of real-world robotic interactions in a structured latent space. Built upon this foundation, GE-Act maps latent representations to executable action trajectories through a lightweight, flow-matching decoder, enabling precise and generalizable policy inference across diverse embodiments with minimal supervision. To support scalable evaluation and training, GE-Sim serves as an action-conditioned neural simulator, producing high-fidelity rollouts for closed-loop policy development. The platform is further equipped with EWMBench, a standardized benchmark suite measuring visual fidelity, physical consistency, and instruction-action alignment. Together, these components establish Genie Envisioner as a scalable and practical foundation for instruction-driven, general-purpose embodied intelligence. All code, models, and benchmarks will be released publicly.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Robot Manipulation | LIBERO | Goal Achievement95.8 | 700 | |
| Video Generation | WorldArena | Interaction Quality19.8 | 9 | |
| Real-robot Manipulation | Real-robot Seen tasks | Stack Bowl Success Rate50 | 6 | |
| Real-robot Manipulation | Real-robot Novel tasks OOD | Place Block Success Rate5 | 5 | |
| Cucumber Peeling | Real-world visuo-tactile dataset | Success Rate0.00e+0 | 4 | |
| Whiteboard Wiping | Real-world visuo-tactile dataset | Success Rate2.5 | 4 | |
| Dynamics Modeling | Bridge dataset (Experiment #2) | PSNR21.47 | 4 | |
| Dynamics Modeling | Real-world tasks Experiment #1 | PSNR21.16 | 4 | |
| Potato Chip Pick-and-Place | Real-world visuo-tactile dataset | Success Rate0.00e+0 | 4 | |
| Action-Conditioned Generation | action-to-video dataset | PSNR18.05 | 3 |