Metric3Dv2: A Versatile Monocular Geometric Foundation Model for Zero-shot Metric Depth and Surface Normal Estimation
About
We introduce Metric3D v2, a geometric foundation model for zero-shot metric depth and surface normal estimation from a single image, which is crucial for metric 3D recovery. While depth and normal are geometrically related and highly complimentary, they present distinct challenges. SoTA monocular depth methods achieve zero-shot generalization by learning affine-invariant depths, which cannot recover real-world metrics. Meanwhile, SoTA normal estimation methods have limited zero-shot performance due to the lack of large-scale labeled data. To tackle these issues, we propose solutions for both metric depth estimation and surface normal estimation. For metric depth estimation, we show that the key to a zero-shot single-view model lies in resolving the metric ambiguity from various camera models and large-scale data training. We propose a canonical camera space transformation module, which explicitly addresses the ambiguity problem and can be effortlessly plugged into existing monocular models. For surface normal estimation, we propose a joint depth-normal optimization module to distill diverse data knowledge from metric depth, enabling normal estimators to learn beyond normal labels. Equipped with these modules, our depth-normal models can be stably trained with over 16 million of images from thousands of camera models with different-type annotations, resulting in zero-shot generalization to in-the-wild images with unseen camera settings. Our method enables the accurate recovery of metric 3D structures on randomly collected internet images, paving the way for plausible single-image metrology. Our project page is at https://JUGGHM.github.io/Metric3Dv2.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Surface Normal Estimation | NYU v2 (test) | Mean Angle Distance (MAD)12 | 206 | |
| Monocular Depth Estimation | KITTI | Abs Rel0.05 | 161 | |
| Monocular Depth Estimation | ETH3D | AbsRel0.124 | 117 | |
| Monocular Depth Estimation | NYU V2 | Delta 1 Acc97 | 113 | |
| Depth Estimation | ScanNet | AbsRel0.023 | 94 | |
| Monocular Depth Estimation | DIODE | AbsRel16 | 93 | |
| Depth Estimation | KITTI | AbsRel0.052 | 92 | |
| Depth Estimation | ScanNet (test) | -- | 65 | |
| Monocular Depth Estimation | ScanNet | AbsRel6.6 | 64 | |
| Depth Estimation | DIODE | Delta-1 Accuracy89.2 | 62 |