Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Aether: Geometric-Aware Unified World Modeling

About

The integration of geometric reconstruction and generative modeling remains a critical challenge in developing AI systems capable of human-like spatial reasoning. This paper proposes Aether, a unified framework that enables geometry-aware reasoning in world models by jointly optimizing three core capabilities: (1) 4D dynamic reconstruction, (2) action-conditioned video prediction, and (3) goal-conditioned visual planning. Through task-interleaved feature learning, Aether achieves synergistic knowledge sharing across reconstruction, prediction, and planning objectives. Building upon video generation models, our framework demonstrates zero-shot synthetic-to-real generalization despite never observing real-world data during training. Furthermore, our approach achieves zero-shot generalization in both action following and reconstruction tasks, thanks to its intrinsic geometric modeling. Notably, even without real-world data, its reconstruction performance is comparable with or even better than that of domain-specific models. Additionally, Aether employs camera trajectories as geometry-informed action spaces, enabling effective action-conditioned prediction and visual planning. We hope our work inspires the community to explore new frontiers in physically-reasonable world modeling and its applications.

Aether Team, Haoyi Zhu, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Chunhua Shen, Jiangmiao Pang, Tong He• 2025

Related benchmarks

TaskDatasetResultRank
Video Depth EstimationSintel
Delta Threshold Accuracy (1.25)60.4
193
Camera pose estimationSintel
ATE0.189
192
Camera pose estimationTUM-dynamic
ATE0.092
163
Video Depth EstimationKITTI
Abs Rel0.054
126
Camera pose estimationScanNet
RPE (t)0.028
119
Video Depth EstimationBONN
AbsRel27.3
116
Video Depth EstimationBONN
Relative Error (Rel)0.273
103
Camera pose estimationTUM dynamics
ATE0.092
81
Depth EstimationSintel ~50 frames
AbsRel0.324
47
Depth EstimationKITTI 110 frames
AbsRel5.6
46
Showing 10 of 46 rows

Other info

Follow for update