ScalableViT: Rethinking the Context-oriented Generalization of Vision Transformer

About

The vanilla self-attention mechanism inherently relies on pre-defined and steadfast computational dimensions. Such inflexibility restricts it from possessing context-oriented generalization that can bring more contextual cues and global representations. To mitigate this issue, we propose a Scalable Self-Attention (SSA) mechanism that leverages two scaling factors to release dimensions of query, key, and value matrices while unbinding them with the input. This scalability fetches context-oriented generalization and enhances object sensitivity, which pushes the whole network into a more effective trade-off state between accuracy and cost. Furthermore, we propose an Interactive Window-based Self-Attention (IWSA), which establishes interaction between non-overlapping regions by re-merging independent value tokens and aggregating spatial information from adjacent windows. By stacking the SSA and IWSA alternately, the Scalable Vision Transformer (ScalableViT) achieves state-of-the-art performance in general-purpose vision tasks. For example, ScalableViT-S outperforms Twins-SVT-S by 1.4% and Swin-T by 1.8% on ImageNet-1K classification.

Rui Yang, Hailong Ma, Jie Wu, Yansong Tang, Xuefeng Xiao, Min Zheng, Xiu Li• 2022

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ADE20K (val)	mIoU44.9	3069
Image Classification	ImageNet-1K 1.0 (val)	Top-1 Accuracy84.1	2238
Image Classification	ImageNet-1k (val)	Top-1 Accuracy84.1	920
Image Classification	ImageNet-1k (val)	Top-1 Accuracy84.1	708

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord