Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

UniNeXt: Exploring A Unified Architecture for Vision Recognition

About

Vision Transformers have shown great potential in computer vision tasks. Most recent works have focused on elaborating the spatial token mixer for performance gains. However, we observe that a well-designed general architecture can significantly improve the performance of the entire backbone, regardless of which spatial token mixer is equipped. In this paper, we propose UniNeXt, an improved general architecture for the vision backbone. To verify its effectiveness, we instantiate the spatial token mixer with various typical and modern designs, including both convolution and attention modules. Compared with the architecture in which they are first proposed, our UniNeXt architecture can steadily boost the performance of all the spatial token mixers, and narrows the performance gap among them. Surprisingly, our UniNeXt equipped with naive local window attention even outperforms the previous state-of-the-art. Interestingly, the ranking of these spatial token mixers also changes under our UniNeXt, suggesting that an excellent spatial token mixer may be stifled due to a suboptimal general architecture, which further shows the importance of the study on the general architecture of vision backbone.

Fangjian Lin, Jianlong Yuan, Sitong Wu, Fan Wang, Zhibin Wang• 2023

Related benchmarks

TaskDatasetResultRank
Video Object SegmentationDAVIS 2017 (val)--
1226
Referring Expression ComprehensionRefCOCO+ (val)
Accuracy85.24
354
Referring Expression ComprehensionRefCOCO (val)
Accuracy92.64
348
Referring Expression ComprehensionRefCOCO (testA)
Accuracy0.9433
346
Referring Expression SegmentationRefCOCO (testA)
cIoU83.4
315
Referring Expression ComprehensionRefCOCOg (test)
Accuracy89.37
300
Referring Expression ComprehensionRefCOCOg (val)
Accuracy88.73
300
Referring Expression SegmentationRefCOCO+ (testA)
cIoU76.4
288
Referring Expression SegmentationRefCOCO+ (val)
cIoU72.5
272
Referring Expression SegmentationRefCOCO (val)
cIoU82.2
261
Showing 10 of 23 rows

Other info

Follow for update