Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Co-Scale Conv-Attentional Image Transformers

About

In this paper, we present Co-scale conv-attentional image Transformers (CoaT), a Transformer-based image classifier equipped with co-scale and conv-attentional mechanisms. First, the co-scale mechanism maintains the integrity of Transformers' encoder branches at individual scales, while allowing representations learned at different scales to effectively communicate with each other; we design a series of serial and parallel blocks to realize the co-scale mechanism. Second, we devise a conv-attentional mechanism by realizing a relative position embedding formulation in the factorized attention module with an efficient convolution-like implementation. CoaT empowers image Transformers with enriched multi-scale and contextual modeling capabilities. On ImageNet, relatively small CoaT models attain superior classification results compared with similar-sized convolutional neural networks and image/vision Transformers. The effectiveness of CoaT's backbone is also illustrated on object detection and instance segmentation, demonstrating its applicability to downstream computer vision tasks.

Weijian Xu, Yifan Xu, Tyler Chang, Zhuowen Tu• 2021

Related benchmarks

TaskDatasetResultRank
Image ClassificationImageNet-1K 1.0 (val)
Top-1 Accuracy82.1
1866
Image ClassificationImageNet (val)
Top-1 Acc80.8
1206
ClassificationImageNet-1K 1.0 (val)
Top-1 Accuracy (%)82.1
1155
Image ClassificationImageNet-1k (val)
Top-1 Accuracy81.9
840
Image ClassificationImageNet 1k (test)
Top-1 Accuracy80.8
798
Image ClassificationImageNet-1k (val)
Top-1 Acc82.1
706
Image ClassificationImageNet-1K 1 (val)
Top-1 Accuracy81.9
119
Image ClassificationImageNet-1k (val)
Top-1 Accuracy81.9
91
Image ClassificationImageNet-1K
Top-1 Accuracy81.9
78
Image ClassificationImageNet 1k (test)
Top-1 Accuracy82.3
55
Showing 10 of 17 rows

Other info

Follow for update