Unlocking the Power of Critical Factors for 3D Visual Geometry Estimation
About
Feed-forward visual geometry estimation has recently made rapid progress. However, an important gap remains: multi-frame models usually produce better cross-frame consistency, yet they often underperform strong per-frame methods on single-frame accuracy. This observation motivates our systematic investigation into the critical factors driving model performance through rigorous ablation studies, which reveals several key insights: 1) Scaling up data diversity and quality unlocks further performance gains even in state-of-the-art visual geometry estimation methods; 2) Commonly adopted confidence-aware loss and gradient-based loss mechanisms may unintentionally hinder performance; 3) Joint supervision through both per-sequence and per-frame alignment improves results, while local region alignment surprisingly degrades performance. Furthermore, we introduce two enhancements to integrate the advantages of optimization-based methods and high-resolution inputs: a consistency loss function that enforces alignment between depth maps, camera parameters, and point maps, and an efficient architectural design that leverages high-resolution information. We integrate these designs into CARVE, a resolution-enhanced model for feed-forward visual geometry estimation. Experiments on point cloud reconstruction, video depth estimation, and camera pose/intrinsic estimation show that CARVE achieves strong and robust performance across diverse benchmarks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Monocular Depth Estimation | KITTI | Abs Rel0.106 | 220 | |
| Video Depth Estimation | BONN | Relative Error (Rel)0.041 | 108 | |
| Monocular Depth Estimation | BONN | Delta 1.25 Accuracy98.5 | 60 | |
| Camera pose estimation | TUM | ATE0.041 | 59 | |
| Depth Estimation | HAMMER | -- | 29 | |
| 2D Depth Estimation | 7 Scenes | -- | 28 | |
| Monocular Depth Estimation | HAMMER | Depth REL0.028 | 26 | |
| Depth Estimation | ETH3D | AbsRel0.033 | 25 | |
| Video Depth Estimation | KITTI | Relative Error (Rel^d)0.082 | 23 | |
| Video Depth Estimation | ETH3D | Relative Error2.3 | 18 |