Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

DAGE: Dual-Stream Architecture for Efficient and Fine-Grained Geometry Estimation

About

Estimating accurate, view-consistent geometry and camera poses from uncalibrated multi-view/video inputs remains challenging - especially at high spatial resolutions and over long sequences. We present DAGE, a dual-stream transformer whose main novelty is to disentangle global coherence from fine detail. A low-resolution stream operates on aggressively downsampled frames with alternating frame/global attention to build a view-consistent representation and estimate cameras efficiently, while a high-resolution stream processes the original images per-frame to preserve sharp boundaries and small structures. A lightweight adapter fuses these streams via cross-attention, injecting global context without disturbing the pretrained single-frame pathway. This design scales resolution and clip length independently, supports inputs up to 2K, and maintains practical inference cost. DAGE delivers sharp depth/pointmaps, strong cross-view consistency, and accurate poses, establishing new state-of-the-art results for video geometry estimation and multi-view reconstruction.

Tuan Duc Ngo, Jiahui Huang, Seoung Wug Oh, Kevin Blackburn-Matzen, Evangelos Kalogerakis, Chuang Gan, Joon-Young Lee• 2026

Related benchmarks

TaskDatasetResultRank
Video Depth EstimationSintel--
235
Camera pose estimationSintel
ATE0.132
203
Monocular Depth EstimationNYU V2--
174
Monocular Depth EstimationETH3D
AbsRel3.49
159
Monocular Depth EstimationDIODE
AbsRel4.97
147
Camera pose estimationScanNet
RPE (t)0.014
133
3D Reconstruction7 Scenes--
128
Monocular Depth EstimationSintel
Abs Rel18.9
127
Camera pose estimationTUM dynamics
ATE0.014
90
Depth EstimationDIODE--
82
Showing 10 of 44 rows

Other info

Follow for update