DAGE: Dual-Stream Architecture for Efficient and Fine-Grained Geometry Estimation
About
Estimating accurate, view-consistent geometry and camera poses from uncalibrated multi-view/video inputs remains challenging - especially at high spatial resolutions and over long sequences. We present DAGE, a dual-stream transformer whose main novelty is to disentangle global coherence from fine detail. A low-resolution stream operates on aggressively downsampled frames with alternating frame/global attention to build a view-consistent representation and estimate cameras efficiently, while a high-resolution stream processes the original images per-frame to preserve sharp boundaries and small structures. A lightweight adapter fuses these streams via cross-attention, injecting global context without disturbing the pretrained single-frame pathway. This design scales resolution and clip length independently, supports inputs up to 2K, and maintains practical inference cost. DAGE delivers sharp depth/pointmaps, strong cross-view consistency, and accurate poses, establishing new state-of-the-art results for video geometry estimation and multi-view reconstruction.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Depth Estimation | Sintel | -- | 193 | |
| Camera pose estimation | Sintel | ATE0.132 | 192 | |
| Monocular Depth Estimation | ETH3D | AbsRel3.49 | 132 | |
| Monocular Depth Estimation | NYU V2 | -- | 131 | |
| Camera pose estimation | ScanNet | RPE (t)0.014 | 119 | |
| Monocular Depth Estimation | DIODE | AbsRel4.97 | 113 | |
| 3D Reconstruction | 7 Scenes | -- | 94 | |
| Monocular Depth Estimation | Sintel | Abs Rel18.9 | 91 | |
| Camera pose estimation | TUM dynamics | ATE0.014 | 81 | |
| Depth Estimation | DIODE | Relative Error (REL)8.7 | 63 |