DAGE: Dual-Stream Architecture for Efficient and Fine-Grained Geometry Estimation
About
Estimating accurate, view-consistent geometry and camera poses from uncalibrated multi-view/video inputs remains challenging - especially at high spatial resolutions and over long sequences. We present DAGE, a dual-stream transformer whose main novelty is to disentangle global coherence from fine detail. A low-resolution stream operates on aggressively downsampled frames with alternating frame/global attention to build a view-consistent representation and estimate cameras efficiently, while a high-resolution stream processes the original images per-frame to preserve sharp boundaries and small structures. A lightweight adapter fuses these streams via cross-attention, injecting global context without disturbing the pretrained single-frame pathway. This design scales resolution and clip length independently, supports inputs up to 2K, and maintains practical inference cost. DAGE delivers sharp depth/pointmaps, strong cross-view consistency, and accurate poses, establishing new state-of-the-art results for video geometry estimation and multi-view reconstruction.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Depth Estimation | Sintel | -- | 235 | |
| Camera pose estimation | Sintel | ATE0.132 | 203 | |
| Monocular Depth Estimation | NYU V2 | -- | 174 | |
| Monocular Depth Estimation | ETH3D | AbsRel3.49 | 159 | |
| Monocular Depth Estimation | DIODE | AbsRel4.97 | 147 | |
| Camera pose estimation | ScanNet | RPE (t)0.014 | 133 | |
| 3D Reconstruction | 7 Scenes | -- | 128 | |
| Monocular Depth Estimation | Sintel | Abs Rel18.9 | 127 | |
| Camera pose estimation | TUM dynamics | ATE0.014 | 90 | |
| Depth Estimation | DIODE | -- | 82 |