Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

UniRefiner: Teaching Pre-trained ViTs to Self-Dispose Dross via Contrastive Register

About

Representation learning with Vision Transformers (ViTs) has advanced rapidly, yet the utility of large-scale models in spatially sensitive tasks is hindered by spurious tokens. Prior efforts to mitigate this have been limited, often defining these artifacts narrowly, for example, as simple high-norm outliers. We argue that this scope is insufficient. For dense prediction tasks, we posit that any token failing to encode location-aligned semantics should be treated as a spurious artifact. This broader definition reveals a more complex problem, leading us to systematically categorize and characterize three fundamental types of spurious tokens that corrupt spatial representations. Based on this comprehensive diagnosis, we propose UniRefiner, a universal refinement framework that teaches pre-trained ViTs to self-dispose of these artifacts. UniRefiner uses contrastive registers to explicitly isolate and redistribute spurious tokens via a dual objective: (i) it aligns image tokens with filtered regular tokens to preserve semantics, and (ii) it aligns register tokens with detected spurious tokens to capture the spurious signals. Our method requires only a few epochs of fine-tuning on ~5k images to refine diverse ViTs, including massive models like EVA-CLIP-8B and InternViT-6B. Experiments demonstrate consistent and significant improvements: notably, the refined EVA-CLIP-8B achieves 51.9\% mIoU on ADE20K (+9.4\%), surpassing specialized vision models like DINOv2 (49.1\%), while zero-shot segmentation accuracy improves by up to 22\%. UniRefiner unlocks the latent spatial potential of existing large-scale foundation models, paving the way for their broader application.

Congpei Qiu, Zhaoyu Hu, Wei Ke, Zhuotao Tian, Yanhao Wu, Tong Zhang• 2026

Related benchmarks

TaskDatasetResultRank
Monocular Depth EstimationNYU V2
Delta 1 Acc94.1
174
Semantic segmentationPascal VOC
mIoU85.4
159
Open Vocabulary Semantic SegmentationCityscapes
mIoU34.6
81
Open Vocabulary Semantic SegmentationADE20K
mIoU21.7
80
Open Vocabulary Semantic SegmentationPASCAL VOC VOC21 with background 2012
mIoU51.3
46
Open-Vocabulary SegmentationCOCO Object
mIoU26.9
40
Open Vocabulary Semantic SegmentationPascal Context 60
mIoU29.4
38
Open Vocabulary Semantic SegmentationPascal Context 59
mIoU34.3
16
Semantic segmentationCityscapes
mIoU74.6
10
Semantic segmentationADE20K
mIoU51.9
10
Showing 10 of 12 rows

Other info

Follow for update