All-in-One: Transferring Vision Foundation Models into Stereo Matching

About

As a fundamental vision task, stereo matching has made remarkable progress. While recent iterative optimization-based methods have achieved promising performance, their feature extraction capabilities still have room for improvement. Inspired by the ability of vision foundation models (VFMs) to extract general representations, in this work, we propose AIO-Stereo which can flexibly select and transfer knowledge from multiple heterogeneous VFMs to a single stereo matching model. To better reconcile features between heterogeneous VFMs and the stereo matching model and fully exploit prior knowledge from VFMs, we proposed a dual-level feature utilization mechanism that aligns heterogeneous features and transfers multi-level knowledge. Based on the mechanism, a dual-level selective knowledge transfer module is designed to selectively transfer knowledge and integrate the advantages of multiple VFMs. Experimental results show that AIO-Stereo achieves start-of-the-art performance on multiple datasets and ranks $1^{st}$ on the Middlebury dataset and outperforms all the published work on the ETH3D benchmark.

Jingyi Zhou, Haoyu Zhang, Jiakang Yuan, Peng Ye, Tao Chen, Hao Jiang, Meiya Chen, Yangyang Zhang• 2024

Related benchmarks

Task	Dataset	Result
Stereo Matching	KITTI 2015 (test)	D1 Error (Overall)1.54	245
Stereo Matching	KITTI 2015	D1 Error (All)1.43	142
Stereo Matching	KITTI 2012	--	108
Stereo Matching	KITTI 2012 (test)	Outlier Rate (3px, Noc)1.05	105
Stereo Matching	ETH3D	bad 1.00.94	95
Stereo Matching	Middlebury	--	84
Stereo Matching	Middlebury (test)	EPE0.85	60
Stereo Matching	Middlebury v3	--	35
Stereo Matching	Middlebury Half resolution (H)	Bad2.0 Error Rate6.48	30
Stereo Matching	Middlebury full resolution	2px Error Rate11.67	21

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord