Depth Anything 3: Recovering the Visual Space from Any Views
About
We present Depth Anything 3 (DA3), a model that predicts spatially consistent geometry from an arbitrary number of visual inputs, with or without known camera poses. In pursuit of minimal modeling, DA3 yields two key insights: a single plain transformer (e.g., vanilla DINO encoder) is sufficient as a backbone without architectural specialization, and a singular depth-ray prediction target obviates the need for complex multi-task learning. Through our teacher-student training paradigm, the model achieves a level of detail and generalization on par with Depth Anything 2 (DA2). We establish a new visual geometry benchmark covering camera pose estimation, any-view geometry and visual rendering. On this benchmark, DA3 sets a new state-of-the-art across all tasks, surpassing prior SOTA VGGT by an average of 44.3% in camera pose accuracy and 25.1% in geometric accuracy. Moreover, it outperforms DA2 in monocular depth estimation. All models are trained exclusively on public academic datasets.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Monocular Depth Estimation | ETH3D | AbsRel11 | 117 | |
| Monocular Depth Estimation | DIODE | AbsRel24.2 | 93 | |
| Monocular Depth Estimation | iBIMS-1 | ARel27.6 | 32 | |
| Monocular Geometry Estimation | KITTI, ETH3D, iBims-1, DIODE Average | AbsRel18.66 | 16 | |
| Pointmap Estimation | Argoverse 2 (AV2) (test) | AbsRel0.174 | 15 | |
| Pointmap Estimation | ONCE (test) | AbsRel0.403 | 15 | |
| Pointmap Estimation | nuScenes (test) | AbsRel0.289 | 15 | |
| Pointmap Estimation | NuPlan subsampled (test) | AbsRel0.265 | 15 | |
| Pointmap Estimation | Waymo (test) | AbsRel0.383 | 15 | |
| Monocular Geometry Estimation | KITTI | AbsRel11.8 | 11 |