Vision Transformers Need More Than Registers

About

Vision Transformers (ViTs), when pre-trained on large-scale data, provide general-purpose representations for diverse downstream tasks. However, artifacts in ViTs are widely observed across different supervision paradigms and downstream tasks. Through systematic analysis of artifacts in ViTs, we find that their fundamental mechanisms have yet to be sufficiently elucidated. In this paper, through systematic analysis, we conclude that these artifacts originate from a lazy aggregation behavior: ViT uses semantically irrelevant background patches as shortcuts to represent global semantics, driven by global attention and Coarse-grained semantic supervision. Our solution selectively integrates patch features into the CLS token, reducing the influence of background-dominated shortcuts and consistently improving performance across 12 benchmarks under label-, text-, and self-supervision. We hope this work offers a new perspective on ViT behavior.

Cheng Shi, Yizhou Yu, Sibei Yang• 2026

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ADE20K	mIoU14.8	559
Semantic segmentation	Cityscapes	mIoU24.5	494
Semantic segmentation	COCO Stuff	mIoU18.3	399
Semantic segmentation	COCO Object	mIoU26.2	139
Instance Segmentation	LVIS	mAP (Mask)34.3	81
Semantic segmentation	Pascal Context 59	mIoU24.7	79
Semantic segmentation	VOC 2012 (val)	mIoU55.1	76
Semantic segmentation	PASCAL VOC 2012	mIoU79.6	42
Unsupervised Object Discovery	PASCAL VOC 2012	CorLoc67.6	42
Unsupervised Object Discovery	PASCAL VOC 2007	CorLoc64.4	33

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord