Less is More: Consistent Video Depth Estimation with Masked Frames Modeling

About

Temporal consistency is the key challenge of video depth estimation. Previous works are based on additional optical flow or camera poses, which is time-consuming. By contrast, we derive consistency with less information. Since videos inherently exist with heavy temporal redundancy, a missing frame could be recovered from neighboring ones. Inspired by this, we propose the frame masking network (FMNet), a spatial-temporal transformer network predicting the depth of masked frames based on their neighboring frames. By reconstructing masked temporal features, the FMNet can learn intrinsic inter-frame correlations, which leads to consistency. Compared with prior arts, experimental results demonstrate that our approach achieves comparable spatial accuracy and higher temporal consistency without any additional information. Our work provides a new perspective on consistent video depth estimation. Our official project page is https://github.com/RaymondWang987/FMNet.

Yiran Wang, Zhiyu Pan, Xingyi Li, Zhiguo Cao, Ke Xian, Jianming Zhang• 2022

Related benchmarks

Task	Dataset	Result
Depth Estimation	KITTI (Eigen split)	RMSE3.744	291
Monocular Depth Estimation	KITTI (Eigen split)	Abs Rel0.099	215
Video Depth Estimation	Sintel (test)	Delta 1 Accuracy49.2	61
Video Depth Estimation	KITTI (test)	Delta188.6	25
Video Depth Estimation	VDW (test)	Delta 147.2	24
Video Depth Estimation	NYUDV2 (Eigen split)	OPW Score0.387	15
Video Depth Estimation	NYUDv2 (test)	delta183.2	12
Video Depth Estimation	KITTI (Eigen split)	Delta1 Acc88.6	9
Video Depth Estimation	Sintel MPI (full)	Delta Threshold Accuracy (< 1.25)35.7	8

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord