Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Cosmos-Transfer1: Conditional World Generation with Adaptive Multimodal Control

About

We introduce Cosmos-Transfer, a conditional world generation model that can generate world simulations based on multiple spatial control inputs of various modalities such as segmentation, depth, and edge. In the design, the spatial conditional scheme is adaptive and customizable. It allows weighting different conditional inputs differently at different spatial locations. This enables highly controllable world generation and finds use in various world-to-world transfer use cases, including Sim2Real. We conduct extensive evaluations to analyze the proposed model and demonstrate its applications for Physical AI, including robotics Sim2Real and autonomous vehicle data enrichment. We further demonstrate an inference scaling strategy to achieve real-time world generation with an NVIDIA GB200 NVL72 rack. To help accelerate research development in the field, we open-source our models and code at https://github.com/nvidia-cosmos/cosmos-transfer1.

NVIDIA: Hassan Abu Alhaija, Jose Alvarez, Maciej Bala, Tiffany Cai, Tianshi Cao, Liz Cha, Joshua Chen, Mike Chen, Francesco Ferroni, Sanja Fidler, Dieter Fox, Yunhao Ge, Jinwei Gu, Ali Hassani, Michael Isaev, Pooya Jannaty, Shiyi Lan, Tobias Lasser, Huan Ling, Ming-Yu Liu, Xian Liu, Yifan Lu, Alice Luo, Qianli Ma, Hanzi Mao, Fabio Ramos, Xuanchi Ren, Tianchang Shen, Xinglong Sun, Shitao Tang, Ting-Chun Wang, Jay Wu, Jiashu Xu, Stella Xu, Kevin Xie, Yuchong Ye, Xiaodong Yang, Xiaohui Zeng, Yu Zeng• 2025

Related benchmarks

TaskDatasetResultRank
Clean DeskReal-world robot tasks
Score4.6
10
Throw BottleReal-world robot tasks
Score3.2
10
Fold clothesReal-world robot tasks
Score4.1
10
Video-to-Video GenerationDROID (test)
VBench0.81
6
Video-to-Video GenerationAgiBot (test)
VBench79.9
6
Video GenerationReWorldBench
FVD281
5
Trajectory-conditioned video generationBridge V2 (test)--
5
Object DetectionPlaying for Benchmark (PFB) (val)
mAP@5014
4
Photorealism EnhancementPlaying for Benchmark (PFB) (val)
KIDx1008.39
4
Video Generation QualityThrow Bottle
Pixel Matching1.60e+3
3
Showing 10 of 13 rows

Other info

Follow for update