Depth Completion as Parameter-Efficient Test-Time Adaptation
About
We introduce CAPA, a parameter-efficient test-time optimization framework that adapts pre-trained 3D foundation models (FMs) for depth completion, using sparse geometric cues. Unlike prior methods that train task-specific encoders for auxiliary inputs, which often overfit and generalize poorly, CAPA freezes the FM backbone. Instead, it updates only a minimal set of parameters using Parameter-Efficient Fine-Tuning (e.g. LoRA or VPT), guided by gradients calculated directly from the sparse observations available at inference time. This approach effectively grounds the foundation model's geometric prior in the scene-specific measurements, correcting distortions and misplaced structures. For videos, CAPA introduces sequence-level parameter sharing, jointly adapting all frames to exploit temporal correlations, improve robustness, and enforce multi-frame consistency. CAPA is model-agnostic, compatible with any ViT-based FM, and achieves state-of-the-art results across diverse condition patterns on both indoor and outdoor datasets. Project page: research.nvidia.com/labs/dvl/projects/capa.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Depth Estimation | ScanNet | AbsRel0.9 | 94 | |
| 2D Depth Estimation | 7 Scenes | Abs Rel0.9 | 20 | |
| Depth Completion | ScanNet SIFT (test) | RMSE (%)0.053 | 16 | |
| Depth Completion | ScanNet 100 pts | RMSE (%)0.053 | 16 | |
| Depth Completion | ScanNet < 3m | RMSE8.9 | 16 | |
| Depth Completion | 7-Scenes SfM | RMSE (%)11.1 | 16 | |
| Depth Completion | 7-Scenes 100 pts | RMSE (%)6.1 | 16 | |
| Depth Completion | 7-Scenes < 3m | RMSE (%)6.6 | 16 | |
| Depth Completion | Metropolis 8-line | RMSE (%)1.31e+3 | 16 | |
| Depth Completion | Metropolis 16-line | RMSE (%)1.20e+3 | 16 |