Learning Temporally Consistent Video Depth from Video Diffusion Priors
About
This work addresses the challenge of streamed video depth estimation, which expects not only per-frame accuracy but, more importantly, cross-frame consistency. We argue that sharing contextual information between frames or clips is pivotal in fostering temporal consistency. Therefore, we reformulate depth prediction into a conditional generation problem to provide contextual information within a clip and across clips. Specifically, we propose a consistent context-aware training and inference strategy for arbitrarily long videos to provide cross-clip context. We sample independent noise levels for each frame within a clip during training while using a sliding window strategy and initializing overlapping frames with previously predicted frames without adding noise. Moreover, we design an effective training strategy to provide context within a clip. Extensive experimental results validate our design choices and demonstrate the superiority of our approach, dubbed ChronoDepth. Project page: https://xdimlab.github.io/ChronoDepth/.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Monocular Depth Estimation | NYU v2 (test) | Abs Rel21.2 | 300 | |
| Depth Estimation | NYU Depth V2 | -- | 209 | |
| Video Depth Estimation | Sintel | Delta Threshold Accuracy (1.25)48.6 | 193 | |
| Video Depth Estimation | KITTI | Abs Rel0.167 | 126 | |
| Video Depth Estimation | BONN | AbsRel16.8 | 116 | |
| Depth Estimation | KITTI | AbsRel0.073 | 106 | |
| Video Depth Estimation | BONN | Relative Error (Rel)0.1 | 103 | |
| Video Depth Estimation | TUM dynamics | Abs Rel0.151 | 53 | |
| Depth Estimation | Sintel ~50 frames | AbsRel0.587 | 47 | |
| Depth Estimation | KITTI 110 frames | AbsRel16.7 | 46 |