Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

DAGE: Dual-Stream Architecture for Efficient and Fine-Grained Geometry Estimation

About

Estimating accurate, view-consistent geometry and camera poses from uncalibrated multi-view/video inputs remains challenging - especially at high spatial resolutions and over long sequences. We present DAGE, a dual-stream transformer whose main novelty is to disentangle global coherence from fine detail. A low-resolution stream operates on aggressively downsampled frames with alternating frame/global attention to build a view-consistent representation and estimate cameras efficiently, while a high-resolution stream processes the original images per-frame to preserve sharp boundaries and small structures. A lightweight adapter fuses these streams via cross-attention, injecting global context without disturbing the pretrained single-frame pathway. This design scales resolution and clip length independently, supports inputs up to 2K, and maintains practical inference cost. DAGE delivers sharp depth/pointmaps, strong cross-view consistency, and accurate poses, establishing new state-of-the-art results for video geometry estimation and multi-view reconstruction.

Tuan Duc Ngo, Jiahui Huang, Seoung Wug Oh, Kevin Blackburn-Matzen, Evangelos Kalogerakis, Chuang Gan, Joon-Young Lee• 2026

Related benchmarks

TaskDatasetResultRank
Video Depth EstimationSintel--
193
Camera pose estimationSintel
ATE0.132
192
Monocular Depth EstimationETH3D
AbsRel3.49
132
Monocular Depth EstimationNYU V2--
131
Camera pose estimationScanNet
RPE (t)0.014
119
Monocular Depth EstimationDIODE
AbsRel4.97
113
3D Reconstruction7 Scenes--
94
Monocular Depth EstimationSintel
Abs Rel18.9
91
Camera pose estimationTUM dynamics
ATE0.014
81
Depth EstimationDIODE
Relative Error (REL)8.7
63
Showing 10 of 44 rows

Other info

Follow for update