Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design

About

Scaling laws have been recently employed to derive compute-optimal model size (number of parameters) for a given compute duration. We advance and refine such methods to infer compute-optimal model shapes, such as width and depth, and successfully implement this in vision transformers. Our shape-optimized vision transformer, SoViT, achieves results competitive with models that exceed twice its size, despite being pre-trained with an equivalent amount of compute. For example, SoViT-400m/14 achieves 90.3% fine-tuning accuracy on ILSRCV2012, surpassing the much larger ViT-g/14 and approaching ViT-G/14 under identical settings, with also less than half the inference cost. We conduct a thorough evaluation across multiple tasks, such as image classification, captioning, VQA and zero-shot transfer, demonstrating the effectiveness of our model across a broad range of domains and identifying limitations. Overall, our findings challenge the prevailing approach of blindly scaling up vision models and pave a path for a more informed scaling.

Ibrahim Alabdulmohsin, Xiaohua Zhai, Alexander Kolesnikov, Lucas Beyer• 2023

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringGQA
Accuracy56
1249
Image ClassificationStanford Cars
Accuracy93.6
635
Image ClassificationCUB-200 2011
Accuracy88.8
356
Image ClassificationCIFAR100
Accuracy86.7
347
Image ClassificationOxford-IIIT Pets
Accuracy97.6
306
Image ClassificationCaltech-101
Accuracy91.3
208
Image ClassificationImageNet
Accuracy84.1
184
Image ClassificationDescribable Textures
Accuracy72.5
41
Image ClassificationUCM
Accuracy97.7
14
Image ClassificationImageNet ReaL 6 (val)
Accuracy0.91
7
Showing 10 of 13 rows

Other info

Code

Follow for update