Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Towards Consistent Video Geometry Estimation

About

This work presents ViGeo, a feed-forward foundation model for recovering spatially dense and temporally consistent geometry from video sequences. Built upon a plain transformer architecture without task-specific architectural modifications, ViGeo supports streaming, full-sequence, and long-video inference within a unified model. The key design is dynamic chunking attention, which exposes the model to both bidirectional and causal temporal contexts during training and allows it to adapt its attention pattern at test time without retraining. To improve supervision quality, we further introduce a completion-based data refinement framework. This framework trains a video depth completion teacher that conditions on sparse and noisy annotations and exploits video/multi-view context to produce dense, temporally coherent, and geometrically reliable training targets. Beyond depth and point maps, ViGeo also predicts surface normals within the same framework. Trained solely on public datasets, ViGeo achieves state-of-the-art performance across online, offline, and long-video depth estimation, surface normal estimation, and video point map estimation.

Zhu Yu, Jingnan Gao, Runmin Zhang, Lingteng Qiu, Zhengyi Zhao, Rui Peng, Yichao Yan, Kejie Qiu, Siyu Zhu, Zilong Dong, Si-Yuan Cao, Hui-Liang Shen• 2026

Related benchmarks

TaskDatasetResultRank
Monocular Depth EstimationSintel
Abs Rel0.24
127
Monocular Depth EstimationKITTI
AbsRel5.4
69
Surface Normal EstimationNYU V2
Mean Angular Error15.11
65
Monocular Depth EstimationBONN
Delta 1.25 Accuracy97.3
60
Video Surface Normal EstimationSintel
Mean Angular Error36.93
25
Video pointmap evaluationKITTI
Relp0.05
24
Video Depth EstimationBonn 400 frames
Abs Rel0.059
15
Video point map estimationSintel--
12
Scale-Invariant Video Depth EstimationSintel
Relative Error (Rel)0.229
11
Scale-Invariant Video Depth EstimationBONN
Relative Error (Rel)4.6
11
Showing 10 of 15 rows

Other info

GitHub

Follow for update