DAGE: Dual-Stream Architecture for Efficient and Fine-Grained Geometry Estimation

About

Estimating accurate, view-consistent geometry and camera poses from uncalibrated multi-view/video inputs remains challenging - especially at high spatial resolutions and over long sequences. We present DAGE, a dual-stream transformer whose main novelty is to disentangle global coherence from fine detail. A low-resolution stream operates on aggressively downsampled frames with alternating frame/global attention to build a view-consistent representation and estimate cameras efficiently, while a high-resolution stream processes the original images per-frame to preserve sharp boundaries and small structures. A lightweight adapter fuses these streams via cross-attention, injecting global context without disturbing the pretrained single-frame pathway. This design scales resolution and clip length independently, supports inputs up to 2K, and maintains practical inference cost. DAGE delivers sharp depth/pointmaps, strong cross-view consistency, and accurate poses, establishing new state-of-the-art results for video geometry estimation and multi-view reconstruction.

Tuan Duc Ngo, Jiahui Huang, Seoung Wug Oh, Kevin Blackburn-Matzen, Evangelos Kalogerakis, Chuang Gan, Joon-Young Lee• 2026

Related benchmarks

Task	Dataset	Result
Video Depth Estimation	Sintel	--	235
Camera pose estimation	Sintel	ATE0.132	203
Monocular Depth Estimation	NYU V2	--	174
Monocular Depth Estimation	ETH3D	AbsRel3.49	159
Monocular Depth Estimation	DIODE	AbsRel4.97	147
Camera pose estimation	ScanNet	RPE (t)0.014	133
3D Reconstruction	7 Scenes	--	128
Monocular Depth Estimation	Sintel	Abs Rel18.9	127
Camera pose estimation	TUM dynamics	ATE0.014	90
Depth Estimation	DIODE	--	82

Showing 10 of 44 rows

Other info

Follow for update

@wizwand_team Discord