Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

JAFAR: Jack up Any Feature at Any Resolution

About

Foundation Vision Encoders have become essential for a wide range of dense vision tasks. However, their low-resolution spatial feature outputs necessitate feature upsampling to produce the high-resolution modalities required for downstream tasks. In this work, we introduce JAFAR, a lightweight and flexible feature upsampler that enhances the spatial resolution of visual features from any Foundation Vision Encoder to an arbitrary target resolution. JAFAR employs an attention-based module designed to promote semantic alignment between high-resolution queries, derived from low-level image features, and semantically enriched low-resolution keys, using Spatial Feature Transform (SFT) modulation. Notably, despite the absence of high-resolution supervision, we demonstrate that learning at low upsampling ratios and resolutions generalizes remarkably well to significantly higher output scales. Extensive experiments show that JAFAR effectively recovers fine-grained spatial details and consistently outperforms existing feature upsampling methods across a diverse set of downstream tasks. Project page at https://jafar-upsampler.github.io

Paul Couairon, Loick Chambon, Louis Serrano, Jean-Emmanuel Haugeard, Matthieu Cord, Nicolas Thome• 2025

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K (val)
mIoU41.96
2731
Semantic segmentationPASCAL VOC (val)
mIoU84.38
338
Semantic segmentationCOCO Stuff
mIoU61.82
195
Semantic segmentationPascal VOC
mIoU0.8436
172
Semantic segmentationCOCO Stuff (val)
mIoU61.71
126
Monocular Depth EstimationNYU V2
Delta 1 Acc91.8
113
Semantic segmentationPascal VOC 21 classes (val)
mIoU0.8444
103
Semantic segmentationCOCO Stuff-27 (val)
mIoU60.78
75
Open-Vocabulary SegmentationCityscapes
mIoU25.26
49
Semantic segmentationVOC
mIoU84.38
44
Showing 10 of 25 rows

Other info

Code

Follow for update