MuRF: Unlocking the Multi-Scale Potential of Vision Foundation Models
About
Vision Foundation Models (VFMs) have become the cornerstone of modern computer vision, offering robust representations across a wide array of tasks. While recent advances allow these models to handle varying input sizes during training, inference typically remains restricted to a single, fixed scale. This prevalent single-scale paradigm overlooks a fundamental property of visual perception: varying resolutions offer complementary inductive biases, where low-resolution views excel at global semantic recognition and high-resolution views are essential for fine-grained refinement. In this work, we propose Multi-Resolution Fusion (MuRF), a simple yet universally effective strategy to harness this synergy at inference time. Instead of relying on a single view, MuRF constructs a unified representation by processing an image at multiple resolutions through a frozen VFM and fusing the resulting features. The universality of MuRF is its most compelling attribute. It is not tied to a specific architecture, serving instead as a fundamental, training-free enhancement to visual representation. We empirically validate this by applying MuRF to a broad spectrum of critical computer vision tasks across multiple distinct VFM families - primarily DINOv2, but also demonstrating successful generalization to contrastive models like SigLIP2.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | ADE20K | mIoU47.4 | 366 | |
| Depth Estimation | NYU Depth V2 | RMSE0.363 | 209 | |
| Semantic segmentation | Pascal VOC | mIoU0.831 | 180 | |
| Depth Estimation | SUN RGB-D | Depth Error0.432 | 34 | |
| Anomaly Detection | MVTec AD mix 2 (test) | AU-PRO0.0562.3 | 4 | |
| Anomaly Detection | MVTec AD 2 (test) | AU-PRO0.0566 | 4 |