Language as Prior, Vision as Calibration: Metric Scale Recovery for Monocular Depth Estimation

About

Relative-depth foundation models transfer well, yet monocular metric depth remains ill-posed due to unidentifiable global scale and heightened domain-shift sensitivity. Under a frozen-backbone calibration setting, we recover metric depth via an image-specific affine transform in inverse depth and train only lightweight calibration heads while keeping the relative-depth backbone and the CLIP text encoder fixed. Since captions provide coarse but noisy scale cues that vary with phrasing and missing objects, we use language to predict an uncertainty-aware envelope that bounds feasible calibration parameters in an unconstrained space, rather than committing to a text-only point estimate. We then use pooled multi-scale frozen visual features to select an image-specific calibration within this envelope. During training, a closed-form least-squares oracle in inverse depth provides per-image supervision for learning the envelope and the selected calibration. Experiments on NYUv2 and KITTI improve in-domain accuracy, while zero-shot transfer to SUN-RGBD and DDAD demonstrates improved robustness over strong language-only baselines.

Mingxia Zhan, Li Zhang, Beibei Wang, Yingjie Wang, Zenglin Shi• 2026

Related benchmarks

Task	Dataset	Result
Monocular Depth Estimation	NYU v2 (test)	Abs Rel0.095	300
Monocular Depth Estimation	KITTI (Eigen split)	Abs Rel0.059	215
Monocular Depth Estimation	SUN-RGBD (test)	AbsRel0.147	22
Monocular Depth Estimation	DDAD outdoor (test)	Delta < 1.25^3 Accuracy98.4	10

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord