Align3R: Aligned Monocular Depth Estimation for Dynamic Videos

About

Recent developments in monocular depth estimation methods enable high-quality depth estimation of single-view images but fail to estimate consistent video depth across different frames. Recent works address this problem by applying a video diffusion model to generate video depth conditioned on the input video, which is training-expensive and can only produce scale-invariant depth values without camera poses. In this paper, we propose a novel video-depth estimation method called Align3R to estimate temporal consistent depth maps for a dynamic video. Our key idea is to utilize the recent DUSt3R model to align estimated monocular depth maps of different timesteps. First, we fine-tune the DUSt3R model with additional estimated monocular depth as inputs for the dynamic scenes. Then, we apply optimization to reconstruct both depth maps and camera poses. Extensive experiments demonstrate that Align3R estimates consistent video depth and camera poses for a monocular video with superior performance than baseline methods.

Jiahao Lu, Tianyu Huang, Peng Li, Zhiyang Dou, Cheng Lin, Zhiming Cui, Zhen Dong, Sai-Kit Yeung, Wenping Wang, Yuan Liu• 2024

Related benchmarks

Task	Dataset	Result
Video Depth Estimation	Sintel	Delta Threshold Accuracy (1.25)64.1	235
Camera pose estimation	Sintel	ATE0.128	203
Video Depth Estimation	BONN	Relative Error (Rel)0.058	108
Camera pose estimation	TUM dynamics	ATE0.011	90
Video Depth Estimation	TUM dynamics	Abs Rel0.104	61
Pose Estimation	BONN	ATE0.023	38
Depth Prediction	Sintel	AbsRel0.253	32
Video Depth Estimation	PointOdyssey (val)	Abs Rel0.077	24
Camera pose estimation	Sintel 14-sequence	ATE0.128	24
Camera pose estimation	Bonn 60 (test)	ATE0.023	9

Showing 10 of 19 rows

Other info

Code

Follow for update

@wizwand_team Discord