Understanding The Robustness in Vision Transformers
About
Recent studies show that Vision Transformers(ViTs) exhibit strong robustness against various corruptions. Although this property is partly attributed to the self-attention mechanism, there is still a lack of systematic understanding. In this paper, we examine the role of self-attention in learning robust representations. Our study is motivated by the intriguing properties of the emerging visual grouping in Vision Transformers, which indicates that self-attention may promote robustness through improved mid-level representations. We further propose a family of fully attentional networks (FANs) that strengthen this capability by incorporating an attentional channel processing design. We validate the design comprehensively on various hierarchical backbones. Our model achieves a state-of-the-art 87.1% accuracy and 35.8% mCE on ImageNet-1k and ImageNet-C with 76.8M parameters. We also demonstrate state-of-the-art accuracy and robustness in two downstream tasks: semantic segmentation and object detection. Code is available at: https://github.com/NVlabs/FAN.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Detection | COCO 2017 (val) | AP54.1 | 2454 | |
| Image Classification | ImageNet-1K 1.0 (val) | Top-1 Accuracy87.1 | 1866 | |
| Semantic segmentation | Cityscapes (val) | -- | 572 | |
| Image Classification | ImageNet A | Top-1 Acc39.6 | 553 | |
| Image Classification | ImageNet-1K | Top-1 Acc86.5 | 524 | |
| Image Classification | ImageNet-Sketch | Top-1 Accuracy40.8 | 360 | |
| Semantic segmentation | Cityscapes (val) | mIoU82.3 | 287 | |
| Image Classification | ImageNet-1k (val) | Accuracy84.3 | 189 | |
| Image Classification | ImageNet-R | Accuracy52.7 | 148 | |
| Image Classification | ImageNet-100 (test) | Clean Accuracy87.3 | 109 |