Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Multiview Equivariance Improves 3D Correspondence Understanding with Minimal Feature Finetuning

About

Vision foundation models, particularly the ViT family, have revolutionized image understanding by providing rich semantic features. However, despite their success in 2D comprehension, their abilities on grasping 3D spatial relationships are still unclear. In this work, we evaluate and enhance the 3D awareness of ViT-based models. We begin by systematically assessing their ability to learn 3D equivariant features, specifically examining the consistency of semantic embeddings across different viewpoints. Our findings indicate that improved 3D equivariance leads to better performance on various downstream tasks, including pose estimation, tracking, and semantic transfer. Building on this insight, we propose a simple yet effective finetuning strategy based on 3D correspondences, which significantly enhances the 3D correspondence understanding of existing vision models. Remarkably, finetuning on a single object for one iteration results in substantial gains. Our code is available at https://github.com/qq456cvb/3DCorrEnhance.

Yang You, Yixin Li, Congyue Deng, Yue Wang, Leonidas Guibas• 2024

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K
mIoU45.14
936
Monocular Depth EstimationKITTI
Abs Rel0.0772
161
Monocular Depth EstimationNYU V2--
113
Depth EstimationScanNet
AbsRel0.1269
94
Surface Normal EstimationNYU V2
RMSE32.6
23
Semantic segmentationScanNet++
Average Accuracy (aAcc)82.23
8
Monocular Depth EstimationScanNet++ (val)
Relative Error (Rel)0.2849
8
Showing 7 of 7 rows

Other info

Follow for update