Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Learning Temporally Consistent Video Depth from Video Diffusion Priors

About

This work addresses the challenge of streamed video depth estimation, which expects not only per-frame accuracy but, more importantly, cross-frame consistency. We argue that sharing contextual information between frames or clips is pivotal in fostering temporal consistency. Therefore, we reformulate depth prediction into a conditional generation problem to provide contextual information within a clip and across clips. Specifically, we propose a consistent context-aware training and inference strategy for arbitrarily long videos to provide cross-clip context. We sample independent noise levels for each frame within a clip during training while using a sliding window strategy and initializing overlapping frames with previously predicted frames without adding noise. Moreover, we design an effective training strategy to provide context within a clip. Extensive experimental results validate our design choices and demonstrate the superiority of our approach, dubbed ChronoDepth. Project page: https://xdimlab.github.io/ChronoDepth/.

Jiahao Shao, Yuanbo Yang, Hongyu Zhou, Youmin Zhang, Yujun Shen, Vitor Guizilini, Yue Wang, Matteo Poggi, Yiyi Liao• 2024

Related benchmarks

TaskDatasetResultRank
Depth EstimationNYU Depth V2--
177
Video Depth EstimationSintel
Relative Error (Rel)0.687
109
Video Depth EstimationBONN
Relative Error (Rel)0.1
103
Depth EstimationKITTI
AbsRel0.073
92
Video Depth EstimationKITTI
Abs Rel0.167
47
Depth PredictionSintel
AbsRel0.687
32
Video Depth EstimationTUM dynamics
Abs Rel0.151
27
3D Point TrackingTAPVid-3D PStudio (minival)
3D-AJ11.8
19
3D Point TrackingTAPVid-3D DriveTrack (minival)
3D AJ Score8
19
3D Point TrackingTAPVid-3D Aria (minival)
3D-AJ12.3
19
Showing 10 of 40 rows

Other info

Code

Follow for update