Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning

About

Multi-scale Vision Transformer (ViT) has emerged as a powerful backbone for computer vision tasks, while the self-attention computation in Transformer scales quadratically w.r.t. the input patch number. Thus, existing solutions commonly employ down-sampling operations (e.g., average pooling) over keys/values to dramatically reduce the computational cost. In this work, we argue that such over-aggressive down-sampling design is not invertible and inevitably causes information dropping especially for high-frequency components in objects (e.g., texture details). Motivated by the wavelet theory, we construct a new Wavelet Vision Transformer (\textbf{Wave-ViT}) that formulates the invertible down-sampling with wavelet transforms and self-attention learning in a unified way. This proposal enables self-attention learning with lossless down-sampling over keys/values, facilitating the pursuing of a better efficiency-vs-accuracy trade-off. Furthermore, inverse wavelet transforms are leveraged to strengthen self-attention outputs by aggregating local contexts with enlarged receptive field. We validate the superiority of Wave-ViT through extensive experiments over multiple vision tasks (e.g., image recognition, object detection and instance segmentation). Its performances surpass state-of-the-art ViT backbones with comparable FLOPs. Source code is available at \url{https://github.com/YehLi/ImageNetModel}.

Ting Yao, Yingwei Pan, Yehao Li, Chong-Wah Ngo, Tao Mei• 2022

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K (val)
mIoU51.5
2731
Object DetectionCOCO 2017 (val)
AP52.1
2454
Image ClassificationImageNet-1K 1.0 (val)
Top-1 Accuracy82.7
1866
Instance SegmentationCOCO 2017 (val)--
1144
Semantic segmentationADE20K--
936
Image ClassificationImageNet-1k (val)
Top-1 Accuracy84.8
840
Object DetectionCOCO 2017
AP (Box)47.2
279
Instance SegmentationCOCO 2017
APm43
199
Image ClassificationImageNet-1K
Top-1 Accuracy82.7
78
Image RecognitionImageNet1K 1.0 (val)
Top-1 Acc85.5
47
Showing 10 of 10 rows

Other info

Code

Follow for update