Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Understanding The Robustness in Vision Transformers

About

Recent studies show that Vision Transformers(ViTs) exhibit strong robustness against various corruptions. Although this property is partly attributed to the self-attention mechanism, there is still a lack of systematic understanding. In this paper, we examine the role of self-attention in learning robust representations. Our study is motivated by the intriguing properties of the emerging visual grouping in Vision Transformers, which indicates that self-attention may promote robustness through improved mid-level representations. We further propose a family of fully attentional networks (FANs) that strengthen this capability by incorporating an attentional channel processing design. We validate the design comprehensively on various hierarchical backbones. Our model achieves a state-of-the-art 87.1% accuracy and 35.8% mCE on ImageNet-1k and ImageNet-C with 76.8M parameters. We also demonstrate state-of-the-art accuracy and robustness in two downstream tasks: semantic segmentation and object detection. Code is available at: https://github.com/NVlabs/FAN.

Daquan Zhou, Zhiding Yu, Enze Xie, Chaowei Xiao, Anima Anandkumar, Jiashi Feng, Jose M. Alvarez• 2022

Related benchmarks

TaskDatasetResultRank
Object DetectionCOCO 2017 (val)
AP54.1
2454
Image ClassificationImageNet-1K 1.0 (val)
Top-1 Accuracy87.1
1866
Semantic segmentationCityscapes (val)--
572
Image ClassificationImageNet A
Top-1 Acc39.6
553
Image ClassificationImageNet-1K
Top-1 Acc86.5
524
Image ClassificationImageNet-Sketch
Top-1 Accuracy40.8
360
Semantic segmentationCityscapes (val)
mIoU82.3
287
Image ClassificationImageNet-1k (val)
Accuracy84.3
189
Image ClassificationImageNet-R
Accuracy52.7
148
Image ClassificationImageNet-100 (test)
Clean Accuracy87.3
109
Showing 10 of 26 rows

Other info

Code

Follow for update