Unifying Flow, Stereo and Depth Estimation

About

We present a unified formulation and model for three motion and 3D perception tasks: optical flow, rectified stereo matching and unrectified stereo depth estimation from posed images. Unlike previous specialized architectures for each specific task, we formulate all three tasks as a unified dense correspondence matching problem, which can be solved with a single model by directly comparing feature similarities. Such a formulation calls for discriminative feature representations, which we achieve using a Transformer, in particular the cross-attention mechanism. We demonstrate that cross-attention enables integration of knowledge from another image via cross-view interactions, which greatly improves the quality of the extracted features. Our unified model naturally enables cross-task transfer since the model architecture and parameters are shared across tasks. We outperform RAFT with our unified model on the challenging Sintel dataset, and our final model that uses a few additional task-specific refinement steps outperforms or compares favorably to recent state-of-the-art methods on 10 popular flow, stereo and depth datasets, while being simpler and more efficient in terms of model design and inference speed.

Haofei Xu, Jing Zhang, Jianfei Cai, Hamid Rezatofighi, Fisher Yu, Dacheng Tao, Andreas Geiger• 2022

Related benchmarks

Task	Dataset	Result
Stereo Matching	KITTI 2015 (test)	D1 Error (Overall)1.77	233
Optical Flow Estimation	Sintel Final (test)	--	133
Stereo Matching	KITTI 2015	D1 Error (All)5.72	118
Optical Flow	KITTI 2015 (test)	Fl Error (All)4.49	109
Stereo Matching	KITTI 2012	Error Rate (3px, All)5.68	108
Depth Estimation	ScanNet (test)	Abs Rel0.059	65
Optical Flow Estimation	KITTI 2015	Fl-all4.49	60
Stereo Matching	Middlebury (test)	EPE1.31	60
Optical Flow	Sintel Clean	EPE1.03	59
Optical Flow	Sintel Final	EPE2.37	59

Showing 10 of 51 rows

Other info

Code

Follow for update

@wizwand_team Discord