Co-Scale Conv-Attentional Image Transformers
About
In this paper, we present Co-scale conv-attentional image Transformers (CoaT), a Transformer-based image classifier equipped with co-scale and conv-attentional mechanisms. First, the co-scale mechanism maintains the integrity of Transformers' encoder branches at individual scales, while allowing representations learned at different scales to effectively communicate with each other; we design a series of serial and parallel blocks to realize the co-scale mechanism. Second, we devise a conv-attentional mechanism by realizing a relative position embedding formulation in the factorized attention module with an efficient convolution-like implementation. CoaT empowers image Transformers with enriched multi-scale and contextual modeling capabilities. On ImageNet, relatively small CoaT models attain superior classification results compared with similar-sized convolutional neural networks and image/vision Transformers. The effectiveness of CoaT's backbone is also illustrated on object detection and instance segmentation, demonstrating its applicability to downstream computer vision tasks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | ImageNet-1K 1.0 (val) | Top-1 Accuracy82.1 | 1866 | |
| Image Classification | ImageNet (val) | Top-1 Acc80.8 | 1206 | |
| Classification | ImageNet-1K 1.0 (val) | Top-1 Accuracy (%)82.1 | 1155 | |
| Image Classification | ImageNet-1k (val) | Top-1 Accuracy81.9 | 840 | |
| Image Classification | ImageNet 1k (test) | Top-1 Accuracy80.8 | 798 | |
| Image Classification | ImageNet-1k (val) | Top-1 Acc82.1 | 706 | |
| Image Classification | ImageNet-1K 1 (val) | Top-1 Accuracy81.9 | 119 | |
| Image Classification | ImageNet-1k (val) | Top-1 Accuracy81.9 | 91 | |
| Image Classification | ImageNet-1K | Top-1 Accuracy81.9 | 78 | |
| Image Classification | ImageNet 1k (test) | Top-1 Accuracy82.3 | 55 |