World Simulation with Video Foundation Models for Physical AI
About
We introduce [Cosmos-Predict2.5], the latest generation of the Cosmos World Foundation Models for Physical AI. Built on a flow-based architecture, [Cosmos-Predict2.5] unifies Text2World, Image2World, and Video2World generation in a single model and leverages [Cosmos-Reason1], a Physical AI vision-language model, to provide richer text grounding and finer control of world simulation. Trained on 200M curated video clips and refined with reinforcement learning-based post-training, [Cosmos-Predict2.5] achieves substantial improvements over [Cosmos-Predict1] in video quality and instruction alignment, with models released at 2B and 14B scales. These capabilities enable more reliable synthetic data generation, policy evaluation, and closed-loop simulation for robotics and autonomous systems. We further extend the family with [Cosmos-Transfer2.5], a control-net style framework for Sim2Real and Real2Real world translation. Despite being 3.5$\times$ smaller than [Cosmos-Transfer1], it delivers higher fidelity and robust long-horizon video generation. Together, these advances establish [Cosmos-Predict2.5] and [Cosmos-Transfer2.5] as versatile tools for scaling embodied intelligence. To accelerate research and deployment in Physical AI, we release source code, pretrained checkpoints, and curated benchmarks under the NVIDIA Open Model License at https://github.com/nvidia-cosmos/cosmos-predict2.5 and https://github.com/nvidia-cosmos/cosmos-transfer2.5. We hope these open resources lower the barrier to adoption and foster innovation in building the next generation of embodied intelligence.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Image Generation | GenEval (test) | Two Obj. Acc99 | 169 | |
| Sim-to-Real Video Translation | nuPlan and CARLA | CLIP-R119.7 | 11 | |
| Robotics Image-to-Video Generation | PAI-Bench-G | Grasp Success84.4 | 8 | |
| Object Navigation | 3D Navigation Evaluation Suite | Visual Consistency73 | 5 | |
| Scene Reasoning | 3D Navigation Evaluation Suite | Visual Consistency60 | 5 | |
| Spatial Grounding | 3D Navigation Evaluation Suite | Visual Consistency60 | 5 | |
| Language Control | 3D Navigation Evaluation Suite | Visual Consistency53 | 5 | |
| Precise Navigation | 3D Navigation Evaluation Suite | Visual Consistency20 | 5 | |
| Dynamics Modeling | Real-world tasks Experiment #1 | PSNR21.17 | 4 | |
| Dynamics Modeling | Bridge dataset (Experiment #2) | PSNR21.32 | 4 |