Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Depth Anything 3: Recovering the Visual Space from Any Views

About

We present Depth Anything 3 (DA3), a model that predicts spatially consistent geometry from an arbitrary number of visual inputs, with or without known camera poses. In pursuit of minimal modeling, DA3 yields two key insights: a single plain transformer (e.g., vanilla DINO encoder) is sufficient as a backbone without architectural specialization, and a singular depth-ray prediction target obviates the need for complex multi-task learning. Through our teacher-student training paradigm, the model achieves a level of detail and generalization on par with Depth Anything 2 (DA2). We establish a new visual geometry benchmark covering camera pose estimation, any-view geometry and visual rendering. On this benchmark, DA3 sets a new state-of-the-art across all tasks, surpassing prior SOTA VGGT by an average of 44.3% in camera pose accuracy and 25.1% in geometric accuracy. Moreover, it outperforms DA2 in monocular depth estimation. All models are trained exclusively on public academic datasets.

Haotong Lin, Sili Chen, Junhao Liew, Donny Y. Chen, Zhenyu Li, Guang Shi, Jiashi Feng, Bingyi Kang• 2025

Related benchmarks

TaskDatasetResultRank
Monocular Depth EstimationETH3D
AbsRel11
117
Monocular Depth EstimationDIODE
AbsRel24.2
93
Monocular Depth EstimationiBIMS-1
ARel27.6
32
Monocular Geometry EstimationKITTI, ETH3D, iBims-1, DIODE Average
AbsRel18.66
16
Pointmap EstimationArgoverse 2 (AV2) (test)
AbsRel0.174
15
Pointmap EstimationONCE (test)
AbsRel0.403
15
Pointmap EstimationnuScenes (test)
AbsRel0.289
15
Pointmap EstimationNuPlan subsampled (test)
AbsRel0.265
15
Pointmap EstimationWaymo (test)
AbsRel0.383
15
Monocular Geometry EstimationKITTI
AbsRel11.8
11
Showing 10 of 15 rows

Other info

GitHub

Follow for update