Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Language as Prior, Vision as Calibration: Metric Scale Recovery for Monocular Depth Estimation

About

Relative-depth foundation models transfer well, yet monocular metric depth remains ill-posed due to unidentifiable global scale and heightened domain-shift sensitivity. Under a frozen-backbone calibration setting, we recover metric depth via an image-specific affine transform in inverse depth and train only lightweight calibration heads while keeping the relative-depth backbone and the CLIP text encoder fixed. Since captions provide coarse but noisy scale cues that vary with phrasing and missing objects, we use language to predict an uncertainty-aware envelope that bounds feasible calibration parameters in an unconstrained space, rather than committing to a text-only point estimate. We then use pooled multi-scale frozen visual features to select an image-specific calibration within this envelope. During training, a closed-form least-squares oracle in inverse depth provides per-image supervision for learning the envelope and the selected calibration. Experiments on NYUv2 and KITTI improve in-domain accuracy, while zero-shot transfer to SUN-RGBD and DDAD demonstrates improved robustness over strong language-only baselines.

Mingxia Zhan, Li Zhang, Beibei Wang, Yingjie Wang, Zenglin Shi• 2026

Related benchmarks

TaskDatasetResultRank
Monocular Depth EstimationNYU v2 (test)
Abs Rel0.095
257
Monocular Depth EstimationKITTI (Eigen split)
Abs Rel0.059
193
Monocular Depth EstimationSUN-RGBD (test)
AbsRel0.147
22
Monocular Depth EstimationDDAD outdoor (test)
Delta < 1.25^3 Accuracy98.4
10
Showing 4 of 4 rows

Other info

Follow for update