Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Aether: Geometric-Aware Unified World Modeling

About

The integration of geometric reconstruction and generative modeling remains a critical challenge in developing AI systems capable of human-like spatial reasoning. This paper proposes Aether, a unified framework that enables geometry-aware reasoning in world models by jointly optimizing three core capabilities: (1) 4D dynamic reconstruction, (2) action-conditioned video prediction, and (3) goal-conditioned visual planning. Through task-interleaved feature learning, Aether achieves synergistic knowledge sharing across reconstruction, prediction, and planning objectives. Building upon video generation models, our framework demonstrates zero-shot synthetic-to-real generalization despite never observing real-world data during training. Furthermore, our approach achieves zero-shot generalization in both action following and reconstruction tasks, thanks to its intrinsic geometric modeling. Notably, even without real-world data, its reconstruction performance is comparable with or even better than that of domain-specific models. Additionally, Aether employs camera trajectories as geometry-informed action spaces, enabling effective action-conditioned prediction and visual planning. We hope our work inspires the community to explore new frontiers in physically-reasonable world modeling and its applications.

Aether Team, Haoyi Zhu, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Chunhua Shen, Jiangmiao Pang, Tong He• 2025

Related benchmarks

TaskDatasetResultRank
Video Depth EstimationSintel
Delta Threshold Accuracy (1.25)60.4
235
Camera pose estimationTUM-dynamic
ATE0.092
205
Camera pose estimationSintel
ATE0.189
203
Video Depth EstimationKITTI
Abs Rel0.054
148
Camera pose estimationScanNet
RPE (t)0.028
133
Video Depth EstimationBONN
AbsRel27.3
131
Video Depth EstimationBONN
Relative Error (Rel)0.273
108
Camera pose estimationTUM dynamics
ATE0.092
90
Depth EstimationSintel ~50 frames
AbsRel0.324
70
Depth EstimationKITTI 110 frames
AbsRel5.6
69
Showing 10 of 52 rows

Other info

Follow for update