Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Less is More: Consistent Video Depth Estimation with Masked Frames Modeling

About

Temporal consistency is the key challenge of video depth estimation. Previous works are based on additional optical flow or camera poses, which is time-consuming. By contrast, we derive consistency with less information. Since videos inherently exist with heavy temporal redundancy, a missing frame could be recovered from neighboring ones. Inspired by this, we propose the frame masking network (FMNet), a spatial-temporal transformer network predicting the depth of masked frames based on their neighboring frames. By reconstructing masked temporal features, the FMNet can learn intrinsic inter-frame correlations, which leads to consistency. Compared with prior arts, experimental results demonstrate that our approach achieves comparable spatial accuracy and higher temporal consistency without any additional information. Our work provides a new perspective on consistent video depth estimation. Our official project page is https://github.com/RaymondWang987/FMNet.

Yiran Wang, Zhiyu Pan, Xingyi Li, Zhiguo Cao, Ke Xian, Jianming Zhang• 2022

Related benchmarks

TaskDatasetResultRank
Depth EstimationKITTI (Eigen split)
RMSE3.744
276
Monocular Depth EstimationKITTI (Eigen split)
Abs Rel0.099
193
Video Depth EstimationSintel (test)
Delta 1 Accuracy49.2
57
Video Depth EstimationKITTI (test)
Delta188.6
25
Video Depth EstimationVDW (test)
Delta 147.2
24
Video Depth EstimationNYUDV2 (Eigen split)
OPW Score0.387
15
Video Depth EstimationNYUDv2 (test)
delta183.2
12
Video Depth EstimationKITTI (Eigen split)
Delta1 Acc88.6
9
Video Depth EstimationSintel MPI (full)
Delta Threshold Accuracy (< 1.25)35.7
8
Showing 9 of 9 rows

Other info

Follow for update