Vision Transformers Need More Than Registers
About
Vision Transformers (ViTs), when pre-trained on large-scale data, provide general-purpose representations for diverse downstream tasks. However, artifacts in ViTs are widely observed across different supervision paradigms and downstream tasks. Through systematic analysis of artifacts in ViTs, we find that their fundamental mechanisms have yet to be sufficiently elucidated. In this paper, through systematic analysis, we conclude that these artifacts originate from a lazy aggregation behavior: ViT uses semantically irrelevant background patches as shortcuts to represent global semantics, driven by global attention and Coarse-grained semantic supervision. Our solution selectively integrates patch features into the CLS token, reducing the influence of background-dominated shortcuts and consistently improving performance across 12 benchmarks under label-, text-, and self-supervision. We hope this work offers a new perspective on ViT behavior.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | COCO Stuff | mIoU18.3 | 379 | |
| Semantic segmentation | ADE20K | mIoU14.8 | 366 | |
| Semantic segmentation | Cityscapes | mIoU24.5 | 218 | |
| Semantic segmentation | COCO Object | mIoU26.2 | 129 | |
| Instance Segmentation | LVIS | mAP (Mask)34.3 | 81 | |
| Semantic segmentation | Pascal Context 59 | mIoU24.7 | 79 | |
| Semantic segmentation | VOC 2012 (val) | mIoU55.1 | 76 | |
| Semantic segmentation | PASCAL VOC 2012 | mIoU79.6 | 42 | |
| Unsupervised Object Discovery | PASCAL VOC 2012 | CorLoc67.6 | 42 | |
| Unsupervised Object Discovery | PASCAL VOC 2007 | CorLoc64.4 | 33 |