Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Unlocking the Power of Critical Factors for 3D Visual Geometry Estimation

About

Feed-forward visual geometry estimation has recently made rapid progress. However, an important gap remains: multi-frame models usually produce better cross-frame consistency, yet they often underperform strong per-frame methods on single-frame accuracy. This observation motivates our systematic investigation into the critical factors driving model performance through rigorous ablation studies, which reveals several key insights: 1) Scaling up data diversity and quality unlocks further performance gains even in state-of-the-art visual geometry estimation methods; 2) Commonly adopted confidence-aware loss and gradient-based loss mechanisms may unintentionally hinder performance; 3) Joint supervision through both per-sequence and per-frame alignment improves results, while local region alignment surprisingly degrades performance. Furthermore, we introduce two enhancements to integrate the advantages of optimization-based methods and high-resolution inputs: a consistency loss function that enforces alignment between depth maps, camera parameters, and point maps, and an efficient architectural design that leverages high-resolution information. We integrate these designs into CARVE, a resolution-enhanced model for feed-forward visual geometry estimation. Experiments on point cloud reconstruction, video depth estimation, and camera pose/intrinsic estimation show that CARVE achieves strong and robust performance across diverse benchmarks.

Guangkai Xu, Hua Geng, Huanyi Zheng, Songyi Yin, Yanlong Sun, Hao Chen, Chunhua Shen• 2026

Related benchmarks

TaskDatasetResultRank
Monocular Depth EstimationKITTI
Abs Rel0.106
220
Video Depth EstimationBONN
Relative Error (Rel)0.041
108
Monocular Depth EstimationBONN
Delta 1.25 Accuracy98.5
60
Camera pose estimationTUM
ATE0.041
59
Depth EstimationHAMMER--
29
2D Depth Estimation7 Scenes--
28
Monocular Depth EstimationHAMMER
Depth REL0.028
26
Depth EstimationETH3D
AbsRel0.033
25
Video Depth EstimationKITTI
Relative Error (Rel^d)0.082
23
Video Depth EstimationETH3D
Relative Error2.3
18
Showing 10 of 25 rows

Other info

Follow for update