Learning Temporally Consistent Video Depth from Video Diffusion Priors
About
This work addresses the challenge of streamed video depth estimation, which expects not only per-frame accuracy but, more importantly, cross-frame consistency. We argue that sharing contextual information between frames or clips is pivotal in fostering temporal consistency. Therefore, we reformulate depth prediction into a conditional generation problem to provide contextual information within a clip and across clips. Specifically, we propose a consistent context-aware training and inference strategy for arbitrarily long videos to provide cross-clip context. We sample independent noise levels for each frame within a clip during training while using a sliding window strategy and initializing overlapping frames with previously predicted frames without adding noise. Moreover, we design an effective training strategy to provide context within a clip. Extensive experimental results validate our design choices and demonstrate the superiority of our approach, dubbed ChronoDepth. Project page: https://xdimlab.github.io/ChronoDepth/.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Depth Estimation | NYU Depth V2 | -- | 177 | |
| Video Depth Estimation | Sintel | Relative Error (Rel)0.687 | 109 | |
| Video Depth Estimation | BONN | Relative Error (Rel)0.1 | 103 | |
| Depth Estimation | KITTI | AbsRel0.073 | 92 | |
| Video Depth Estimation | KITTI | Abs Rel0.167 | 47 | |
| Depth Prediction | Sintel | AbsRel0.687 | 32 | |
| Video Depth Estimation | TUM dynamics | Abs Rel0.151 | 27 | |
| 3D Point Tracking | TAPVid-3D PStudio (minival) | 3D-AJ11.8 | 19 | |
| 3D Point Tracking | TAPVid-3D DriveTrack (minival) | 3D AJ Score8 | 19 | |
| 3D Point Tracking | TAPVid-3D Aria (minival) | 3D-AJ12.3 | 19 |