Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Unlocking Dense Metric Depth Estimation in VLMs

About

Vision-Language Models (VLMs) excel at 2D tasks such as grounding and captioning, yet remain limited in 3D understanding. A key limitation is their text-only supervision paradigm, which under-constrains fine-grained visual perception and prevents the recovery of dense geometry. Prior methods either distill geometry from external vision models, introducing error accumulation, or enable direct prediction with inefficient per-pixel query or coarse token-level outputs. In this paper, we propose DepthVLM, a simple yet effective framework that transforms a single VLM into a native dense geometry predictor while preserving its multimodal capability. By attaching a lightweight depth head to the LLM backbone and training under a unified vision-text supervision paradigm with a two-stage schedule, DepthVLM generates full-resolution depth maps alongside language outputs in a single forward pass. We further introduce a unified indoor-outdoor metric depth benchmark in a VLM-compatible format. Experiments show that DepthVLM significantly outperforms existing VLMs with higher inference efficiency, surpasses leading pure vision models, and improves complex 3D spatial reasoning, moving toward a truly unified multimodal foundation model. The project page is available at https://depthvlm.github.io/

Hanxun Yu, Xuan Qu, Yuxin Wang, Jianke Zhu, Lei Ke• 2026

Related benchmarks

TaskDatasetResultRank
Metric Depth EstimationDepthVLM-Bench 1.0 (test)
Delta1 Accuracy (Argoverse2)81
16
Metric Depth EstimationETH3D DepthVLM-Bench (evaluation)
Delta-1 Accuracy92.8
11
Metric Depth EstimationDepthVLM-Bench Average of Waymo, NuScenes, ETH3D, sunRGBD, IBims-1 (evaluation)
Delta 1 Accuracy89
11
Metric Depth EstimationNuScenes DepthVLM-Bench (evaluation)
Delta 1 Score83.1
11
Metric Depth EstimationIBims-1 DepthVLM-Bench (evaluation)
Delta-1 Accuracy93.6
11
Metric Depth EstimationWaymo DepthVLM-Bench (val)
Delta1 Error87.9
11
Metric Depth EstimationsunRGBD DepthVLM-Bench (evaluation)
Delta-1 Accuracy88.9
11
Showing 7 of 7 rows

Other info

GitHub

Follow for update