Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

PVT v2: Improved Baselines with Pyramid Vision Transformer

About

Transformer recently has presented encouraging progress in computer vision. In this work, we present new baselines by improving the original Pyramid Vision Transformer (PVT v1) by adding three designs, including (1) linear complexity attention layer, (2) overlapping patch embedding, and (3) convolutional feed-forward network. With these modifications, PVT v2 reduces the computational complexity of PVT v1 to linear and achieves significant improvements on fundamental vision tasks such as classification, detection, and segmentation. Notably, the proposed PVT v2 achieves comparable or better performances than recent works such as Swin Transformer. We hope this work will facilitate state-of-the-art Transformer researches in computer vision. Code is available at https://github.com/whai362/PVT.

Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao• 2021

Related benchmarks

TaskDatasetResultRank
Image ClassificationCIFAR-10 (test)--
3381
Semantic segmentationADE20K (val)
mIoU48.9
2888
Object DetectionCOCO 2017 (val)
AP51.1
2643
Image ClassificationImageNet-1K 1.0 (val)
Top-1 Accuracy83.8
1952
Image ClassificationImageNet-1K
Top-1 Acc83.8
1239
Instance SegmentationCOCO 2017 (val)
APm0.432
1201
ClassificationImageNet-1K 1.0 (val)
Top-1 Accuracy (%)83.8
1163
Semantic segmentationCityscapes (test)
mIoU80.6
1154
Semantic segmentationADE20K
mIoU42.5
1024
Image ClassificationImageNet 1k (test)
Top-1 Accuracy83.8
848
Showing 10 of 75 rows
...

Other info

Code

Follow for update