Aether: Geometric-Aware Unified World Modeling
About
The integration of geometric reconstruction and generative modeling remains a critical challenge in developing AI systems capable of human-like spatial reasoning. This paper proposes Aether, a unified framework that enables geometry-aware reasoning in world models by jointly optimizing three core capabilities: (1) 4D dynamic reconstruction, (2) action-conditioned video prediction, and (3) goal-conditioned visual planning. Through task-interleaved feature learning, Aether achieves synergistic knowledge sharing across reconstruction, prediction, and planning objectives. Building upon video generation models, our framework demonstrates zero-shot synthetic-to-real generalization despite never observing real-world data during training. Furthermore, our approach achieves zero-shot generalization in both action following and reconstruction tasks, thanks to its intrinsic geometric modeling. Notably, even without real-world data, its reconstruction performance is comparable with or even better than that of domain-specific models. Additionally, Aether employs camera trajectories as geometry-informed action spaces, enabling effective action-conditioned prediction and visual planning. We hope our work inspires the community to explore new frontiers in physically-reasonable world modeling and its applications.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Depth Estimation | Sintel | Delta Threshold Accuracy (1.25)60.4 | 193 | |
| Camera pose estimation | Sintel | ATE0.189 | 192 | |
| Camera pose estimation | TUM-dynamic | ATE0.092 | 163 | |
| Video Depth Estimation | KITTI | Abs Rel0.054 | 126 | |
| Camera pose estimation | ScanNet | RPE (t)0.028 | 119 | |
| Video Depth Estimation | BONN | AbsRel27.3 | 116 | |
| Video Depth Estimation | BONN | Relative Error (Rel)0.273 | 103 | |
| Camera pose estimation | TUM dynamics | ATE0.092 | 81 | |
| Depth Estimation | Sintel ~50 frames | AbsRel0.324 | 47 | |
| Depth Estimation | KITTI 110 frames | AbsRel5.6 | 46 |