Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design

About

Scaling laws have been recently employed to derive compute-optimal model size (number of parameters) for a given compute duration. We advance and refine such methods to infer compute-optimal model shapes, such as width and depth, and successfully implement this in vision transformers. Our shape-optimized vision transformer, SoViT, achieves results competitive with models that exceed twice its size, despite being pre-trained with an equivalent amount of compute. For example, SoViT-400m/14 achieves 90.3% fine-tuning accuracy on ILSRCV2012, surpassing the much larger ViT-g/14 and approaching ViT-G/14 under identical settings, with also less than half the inference cost. We conduct a thorough evaluation across multiple tasks, such as image classification, captioning, VQA and zero-shot transfer, demonstrating the effectiveness of our model across a broad range of domains and identifying limitations. Overall, our findings challenge the prevailing approach of blindly scaling up vision models and pave a path for a more informed scaling.

Ibrahim Alabdulmohsin, Xiaohua Zhai, Alexander Kolesnikov, Lucas Beyer• 2023

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringGQA
Accuracy56
963
Image ClassificationStanford Cars
Accuracy93.6
477
Image ClassificationCIFAR100
Accuracy86.7
331
Image ClassificationOxford-IIIT Pets
Accuracy97.6
259
Image ClassificationCUB-200 2011
Accuracy88.8
257
Image ClassificationCaltech-101
Accuracy91.3
198
Image ClassificationImageNet
Accuracy84.1
184
Image ClassificationDescribable Textures
Accuracy72.5
41
Image ClassificationUCM
Accuracy97.7
14
Image ClassificationImageNet ReaL 6 (val)
Accuracy0.91
7
Showing 10 of 13 rows

Other info

Code

Follow for update