UrbanVerse: Scaling Urban Simulation by Watching City-Tour Videos
About
Urban embodied AI agents, ranging from delivery robots to quadrupeds, are increasingly populating our cities, navigating chaotic streets to provide last-mile connectivity. Training such agents requires diverse, high-fidelity urban environments to scale, yet existing human-crafted or procedurally generated simulation scenes either lack scalability or fail to capture real-world complexity. We introduce UrbanVerse, a data-driven real-to-sim system that converts crowd-sourced city-tour videos into physics-aware, interactive simulation scenes. UrbanVerse consists of: (i) UrbanVerse-100K, a repository of 100k+ annotated urban 3D assets with semantic and physical attributes, and (ii) UrbanVerse-Gen, an automatic pipeline that extracts scene layouts from video and instantiates metric-scale 3D simulations using retrieved assets. Running in IsaacSim, UrbanVerse offers 160 high-quality constructed scenes from 24 countries, along with a curated benchmark of 10 artist-designed test scenes. Experiments show that UrbanVerse scenes preserve real-world semantics and layouts, achieving human-evaluated realism comparable to manually crafted scenes. In urban navigation, policies trained in UrbanVerse exhibit scaling power laws and strong generalization, improving success by +6.3% in simulation and +30.1% in zero-shot sim-to-real transfer comparing to prior methods, accomplishing a 300 m real-world mission with only two interventions.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Embodied AI Simulation | Embodied AI Simulators Comparison | Number of Assets1.02e+5 | 10 | |
| Navigation | CraftBench (test) | Success Rate (SR)41.9 | 6 | |
| Mapless Urban Navigation | Real-world Wheeled 1.0 (test) | Success Rate (SR)77.1 | 5 | |
| Mapless Urban Navigation | Quadruped Real-world 1.0 (test) | Success Rate (SR)89.7 | 5 | |
| Urban Embodied AI Simulation | Urban Embodied-AI Simulators | # Object Classes659 | 4 |