Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SG-Former: Self-guided Transformer with Evolving Token Reallocation

About

Vision Transformer has demonstrated impressive success across various vision tasks. However, its heavy computation cost, which grows quadratically with respect to the token sequence length, largely limits its power in handling large feature maps. To alleviate the computation cost, previous works rely on either fine-grained self-attentions restricted to local small regions, or global self-attentions but to shorten the sequence length resulting in coarse granularity. In this paper, we propose a novel model, termed as Self-guided Transformer~(SG-Former), towards effective global self-attention with adaptive fine granularity. At the heart of our approach is to utilize a significance map, which is estimated through hybrid-scale self-attention and evolves itself during training, to reallocate tokens based on the significance of each region. Intuitively, we assign more tokens to the salient regions for achieving fine-grained attention, while allocating fewer tokens to the minor regions in exchange for efficiency and global receptive fields. The proposed SG-Former achieves performance superior to state of the art: our base size model achieves \textbf{84.7\%} Top-1 accuracy on ImageNet-1K, \textbf{51.2mAP} bbAP on CoCo, \textbf{52.7mIoU} on ADE20K surpassing the Swin Transformer by \textbf{+1.3\% / +2.7 mAP/ +3 mIoU}, with lower computation costs and fewer parameters. The code is available at \href{https://github.com/OliverRensu/SG-Former}{https://github.com/OliverRensu/SG-Former}

Sucheng Ren, Xingyi Yang, Songhua Liu, Xinchao Wang• 2023

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K (val)--
3069
Image ClassificationImageNet-1K 1.0 (val)
Top-1 Accuracy84.7
2238
Semantic segmentationADE20K
mIoU50.6
1028
Image ClassificationImageNet 1k (test)
Top-1 Accuracy84.7
456
Object DetectionCOCO 2017
AP (Box)48.2
345
Instance SegmentationCOCO 2017
APm43.6
236
Image ClassificationImageNet-1k 1.0 (test)
Top-1 Accuracy0.841
191
Medical Image ClassificationBUSI
Accuracy78.19
126
Image ClassificationKvasir
Mean Accuracy89.48
51
Medical Image ClassificationISIC 2018
Accuracy84.08
40
Showing 10 of 11 rows

Other info

Follow for update