Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Vision Transformers Need More Than Registers

About

Vision Transformers (ViTs), when pre-trained on large-scale data, provide general-purpose representations for diverse downstream tasks. However, artifacts in ViTs are widely observed across different supervision paradigms and downstream tasks. Through systematic analysis of artifacts in ViTs, we find that their fundamental mechanisms have yet to be sufficiently elucidated. In this paper, through systematic analysis, we conclude that these artifacts originate from a lazy aggregation behavior: ViT uses semantically irrelevant background patches as shortcuts to represent global semantics, driven by global attention and Coarse-grained semantic supervision. Our solution selectively integrates patch features into the CLS token, reducing the influence of background-dominated shortcuts and consistently improving performance across 12 benchmarks under label-, text-, and self-supervision. We hope this work offers a new perspective on ViT behavior.

Cheng Shi, Yizhou Yu, Sibei Yang• 2026

Related benchmarks

TaskDatasetResultRank
Semantic segmentationCOCO Stuff
mIoU18.3
379
Semantic segmentationADE20K
mIoU14.8
366
Semantic segmentationCityscapes
mIoU24.5
218
Semantic segmentationCOCO Object
mIoU26.2
129
Instance SegmentationLVIS
mAP (Mask)34.3
81
Semantic segmentationPascal Context 59
mIoU24.7
79
Semantic segmentationVOC 2012 (val)
mIoU55.1
76
Semantic segmentationPASCAL VOC 2012
mIoU79.6
42
Unsupervised Object DiscoveryPASCAL VOC 2012
CorLoc67.6
42
Unsupervised Object DiscoveryPASCAL VOC 2007
CorLoc64.4
33
Showing 10 of 13 rows

Other info

Follow for update