Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Playing to Vision Foundation Model's Strengths in Stereo Matching

About

Stereo matching has become a key technique for 3D environment perception in intelligent vehicles. For a considerable time, convolutional neural networks (CNNs) have remained the mainstream choice for feature extraction in this domain. Nonetheless, there is a growing consensus that the existing paradigm should evolve towards vision foundation models (VFM), particularly those developed based on vision Transformers (ViTs) and pre-trained through self-supervision on extensive, unlabeled datasets. While VFMs are adept at extracting informative, general-purpose visual features, specifically for dense prediction tasks, their performance often lacks in geometric vision tasks. This study serves as the first exploration of a viable approach for adapting VFMs to stereo matching. Our ViT adapter, referred to as ViTAS, is constructed upon three types of modules: spatial differentiation, patch attention fusion, and cross-attention. The first module initializes feature pyramids, while the latter two aggregate stereo and multi-scale contextual information into fine-grained features, respectively. ViTAStereo, which combines ViTAS with cost volume-based stereo matching back-end processes, achieves the top rank on the KITTI Stereo 2012 dataset and outperforms the second-best network StereoBase by approximately 7.9% in terms of the percentage of error pixels, with a tolerance of 3 pixels. Additional experiments across diverse scenarios further demonstrate its superior generalizability compared to all other state-of-the-art approaches. We believe this new paradigm will pave the way for the next generation of stereo matching networks.

Chuang-Wei Liu, Qijun Chen, Rui Fan• 2024

Related benchmarks

TaskDatasetResultRank
Stereo MatchingKITTI 2015 (all pixels)
D1 Error (Background)1.21
38
Stereo MatchingKITTI 2012 (Noc)
Error Rate (>2px)1.46
26
Stereo MatchingKITTI 2012 (All split)
Error Rate (>2px)1.8
26
Stereo MatchingKITTI 2015 (non-occluded)
D1 Error (Background)1.12
25
Showing 4 of 4 rows

Other info

Follow for update