UniRefiner: Teaching Pre-trained ViTs to Self-Dispose Dross via Contrastive Register

About

Representation learning with Vision Transformers (ViTs) has advanced rapidly, yet the utility of large-scale models in spatially sensitive tasks is hindered by spurious tokens. Prior efforts to mitigate this have been limited, often defining these artifacts narrowly, for example, as simple high-norm outliers. We argue that this scope is insufficient. For dense prediction tasks, we posit that any token failing to encode location-aligned semantics should be treated as a spurious artifact. This broader definition reveals a more complex problem, leading us to systematically categorize and characterize three fundamental types of spurious tokens that corrupt spatial representations. Based on this comprehensive diagnosis, we propose UniRefiner, a universal refinement framework that teaches pre-trained ViTs to self-dispose of these artifacts. UniRefiner uses contrastive registers to explicitly isolate and redistribute spurious tokens via a dual objective: (i) it aligns image tokens with filtered regular tokens to preserve semantics, and (ii) it aligns register tokens with detected spurious tokens to capture the spurious signals. Our method requires only a few epochs of fine-tuning on ~5k images to refine diverse ViTs, including massive models like EVA-CLIP-8B and InternViT-6B. Experiments demonstrate consistent and significant improvements: notably, the refined EVA-CLIP-8B achieves 51.9\% mIoU on ADE20K (+9.4\%), surpassing specialized vision models like DINOv2 (49.1\%), while zero-shot segmentation accuracy improves by up to 22\%. UniRefiner unlocks the latent spatial potential of existing large-scale foundation models, paving the way for their broader application.

Congpei Qiu, Zhaoyu Hu, Wei Ke, Zhuotao Tian, Yanhao Wu, Tong Zhang• 2026

Related benchmarks

Task	Dataset	Result
Semantic segmentation	Pascal VOC	mIoU85.4	214
Monocular Depth Estimation	NYU V2	Delta 1 Acc94.1	192
Open Vocabulary Semantic Segmentation	ADE20K	mIoU21.7	110
Open Vocabulary Semantic Segmentation	Cityscapes	mIoU34.6	105
Open Vocabulary Semantic Segmentation	Pascal Context 60	mIoU29.4	65
Open Vocabulary Semantic Segmentation	PASCAL VOC VOC21 with background 2012	mIoU51.3	46
Open-Vocabulary Segmentation	COCO Object	mIoU26.9	40
Open Vocabulary Semantic Segmentation	Pascal Context 59	mIoU34.3	20
Semantic segmentation	Cityscapes	mIoU74.6	10
Semantic segmentation	ADE20K	mIoU51.9	10

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord