Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Rotary Position Embedding for Vision Transformer

About

Rotary Position Embedding (RoPE) performs remarkably on language models, especially for length extrapolation of Transformers. However, the impacts of RoPE on computer vision domains have been underexplored, even though RoPE appears capable of enhancing Vision Transformer (ViT) performance in a way similar to the language domain. This study provides a comprehensive analysis of RoPE when applied to ViTs, utilizing practical implementations of RoPE for 2D vision data. The analysis reveals that RoPE demonstrates impressive extrapolation performance, i.e., maintaining precision while increasing image resolution at inference. It eventually leads to performance improvement for ImageNet-1k, COCO detection, and ADE-20k segmentation. We believe this study provides thorough guidelines to apply RoPE into ViT, promising improved backbone performance with minimal extra computational overhead. Our code and pre-trained models are available at https://github.com/naver-ai/rope-vit

Byeongho Heo, Song Park, Dongyoon Han, Sangdoo Yun• 2024

Related benchmarks

TaskDatasetResultRank
Image ClassificationImageNet-1K
Top-1 Acc83.8
1239
Image ClassificationImageNet-1k (val)
Top-1 Accuracy80.4
543
Image ClassificationImageNet 1k (test)
Top-1 Accuracy80
450
Action RecognitionUCF101
Accuracy41.2
431
Image ClassificationCIFAR-100
Accuracy75.6
302
Class-conditional Image GenerationImageNet--
158
Object DetectionCOCO
mAP38
137
3D Object ClassificationModelNet40--
78
Semantic segmentationScanNet
mIoU71.1
59
Image-to-Text RetrievalDOCCI--
38
Showing 10 of 34 rows

Other info

Follow for update