Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures

About

Transformers have revolutionized computer vision and natural language processing, but their high computational complexity limits their application in high-resolution image processing and long-context analysis. This paper introduces Vision-RWKV (VRWKV), a model adapted from the RWKV model used in the NLP field with necessary modifications for vision tasks. Similar to the Vision Transformer (ViT), our model is designed to efficiently handle sparse inputs and demonstrate robust global processing capabilities, while also scaling up effectively, accommodating both large-scale parameters and extensive datasets. Its distinctive advantage lies in its reduced spatial aggregation complexity, which renders it exceptionally adept at processing high-resolution images seamlessly, eliminating the necessity for windowing operations. Our evaluations demonstrate that VRWKV surpasses ViT's performance in image classification and has significantly faster speeds and lower memory usage processing high-resolution inputs. In dense prediction tasks, it outperforms window-based models, maintaining comparable speeds. These results highlight VRWKV's potential as a more efficient alternative for visual perception tasks. Code is released at https://github.com/OpenGVLab/Vision-RWKV.

Yuchen Duan, Weiyun Wang, Zhe Chen, Xizhou Zhu, Lewei Lu, Tong Lu, Yu Qiao, Hongsheng Li, Jifeng Dai, Wenhai Wang• 2024

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K (val)
mIoU49.2
2888
Object DetectionCOCO 2017 (val)--
2643
Instance SegmentationCOCO 2017 (val)
APm0.417
1201
Semantic segmentationADE20K
mIoU49.2
1024
Image ClassificationImageNet 1k (test)
Top-1 Accuracy82
450
Object DetectionCOCO 2017
AP (Box)46.8
321
Image ClassificationImageNet-1k 1.0 (test)
Top-1 Accuracy82
229
Instance SegmentationCOCO 2017
APm41.7
226
Showing 8 of 8 rows

Other info

Follow for update