RAD: Retrieval-Augmented Monocular Metric Depth Estimation for Underrepresented Classes

About

Monocular Metric Depth Estimation (MMDE) is essential for physically intelligent systems, yet accurate depth estimation for underrepresented classes in complex scenes remains a persistent challenge. To address this, we propose RAD, a retrieval-augmented framework that approximates the benefits of multi-view stereo by utilizing retrieved neighbors as structural geometric proxies. Our method first employs an uncertainty-aware retrieval mechanism to identify low-confidence regions in the input and retrieve RGB-D context samples containing semantically similar content. We then process both the input and retrieved context via a dual-stream network and fuse them using a matched cross-attention module, which transfers geometric information only at reliable point correspondences. Evaluations on NYU Depth v2, KITTI, and Cityscapes demonstrate that RAD significantly outperforms state-of-the-art baselines on underrepresented classes, reducing relative absolute error by 29.2% on NYU Depth v2, 13.3% on KITTI, and 7.2% on Cityscapes, while maintaining competitive performance on standard in-domain benchmarks.

Michael Baltaxe, Dan Levi, Sagie Benaim• 2026

Related benchmarks

Task	Dataset	Result
Depth Estimation	NYU Depth V2	RMSE0.192	209
Depth Estimation	KITTI	RMSE1.801	156
Monocular Depth Estimation	NYU Depth v2 (Underrepresented classes)	Delta 197.5	10
Monocular Depth Estimation	KITTI (Underrepresented classes)	Acc Delta 1.25^196.6	10
Monocular Depth Estimation	Cityscapes Underrepresented classes	Accuracy δ < 1.25^193.5	8

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord