Language as Prior, Vision as Calibration: Metric Scale Recovery for Monocular Depth Estimation
About
Relative-depth foundation models transfer well, yet monocular metric depth remains ill-posed due to unidentifiable global scale and heightened domain-shift sensitivity. Under a frozen-backbone calibration setting, we recover metric depth via an image-specific affine transform in inverse depth and train only lightweight calibration heads while keeping the relative-depth backbone and the CLIP text encoder fixed. Since captions provide coarse but noisy scale cues that vary with phrasing and missing objects, we use language to predict an uncertainty-aware envelope that bounds feasible calibration parameters in an unconstrained space, rather than committing to a text-only point estimate. We then use pooled multi-scale frozen visual features to select an image-specific calibration within this envelope. During training, a closed-form least-squares oracle in inverse depth provides per-image supervision for learning the envelope and the selected calibration. Experiments on NYUv2 and KITTI improve in-domain accuracy, while zero-shot transfer to SUN-RGBD and DDAD demonstrates improved robustness over strong language-only baselines.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Monocular Depth Estimation | NYU v2 (test) | Abs Rel0.095 | 257 | |
| Monocular Depth Estimation | KITTI (Eigen split) | Abs Rel0.059 | 193 | |
| Monocular Depth Estimation | SUN-RGBD (test) | AbsRel0.147 | 22 | |
| Monocular Depth Estimation | DDAD outdoor (test) | Delta < 1.25^3 Accuracy98.4 | 10 |