Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

LoftUp: Learning a Coordinate-Based Feature Upsampler for Vision Foundation Models

About

Vision foundation models (VFMs) such as DINOv2 and CLIP have achieved impressive results on various downstream tasks, but their limited feature resolution hampers performance in applications requiring pixel-level understanding. Feature upsampling offers a promising direction to address this challenge. In this work, we identify two critical factors for enhancing feature upsampling: the upsampler architecture and the training objective. For the upsampler architecture, we introduce a coordinate-based cross-attention transformer that integrates the high-resolution images with coordinates and low-resolution VFM features to generate sharp, high-quality features. For the training objective, we propose constructing high-resolution pseudo-groundtruth features by leveraging class-agnostic masks and self-distillation. Our approach effectively captures fine-grained details and adapts flexibly to various input and feature resolutions. Through experiments, we demonstrate that our approach significantly outperforms existing feature upsampling techniques across various downstream tasks. Our code is released at https://github.com/andrehuang/loftup.

Haiwen Huang, Anpei Chen, Volodymyr Havrylov, Andreas Geiger, Dan Zhang• 2025

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K (val)
mIoU42.16
2731
Semantic segmentationPASCAL VOC (val)
mIoU84.63
338
Semantic segmentationCOCO Stuff
mIoU62.15
195
Semantic segmentationPascal VOC
mIoU0.8369
172
Semantic segmentationCOCO Stuff (val)
mIoU62.19
126
Monocular Depth EstimationNYU V2
Delta 1 Acc91.66
113
Semantic segmentationADE20K
mIoU42.02
30
Surface Normal EstimationNYU V2
RMSE33.94
23
Semantic segmentationCityscapes (val)
mIoU62.09
9
Depth EstimationCOCO (val)
δ158.69
9
Showing 10 of 10 rows

Other info

Follow for update