Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design
About
Scaling laws have been recently employed to derive compute-optimal model size (number of parameters) for a given compute duration. We advance and refine such methods to infer compute-optimal model shapes, such as width and depth, and successfully implement this in vision transformers. Our shape-optimized vision transformer, SoViT, achieves results competitive with models that exceed twice its size, despite being pre-trained with an equivalent amount of compute. For example, SoViT-400m/14 achieves 90.3% fine-tuning accuracy on ILSRCV2012, surpassing the much larger ViT-g/14 and approaching ViT-G/14 under identical settings, with also less than half the inference cost. We conduct a thorough evaluation across multiple tasks, such as image classification, captioning, VQA and zero-shot transfer, demonstrating the effectiveness of our model across a broad range of domains and identifying limitations. Overall, our findings challenge the prevailing approach of blindly scaling up vision models and pave a path for a more informed scaling.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | GQA | Accuracy56 | 963 | |
| Image Classification | Stanford Cars | Accuracy93.6 | 477 | |
| Image Classification | CIFAR100 | Accuracy86.7 | 331 | |
| Image Classification | Oxford-IIIT Pets | Accuracy97.6 | 259 | |
| Image Classification | CUB-200 2011 | Accuracy88.8 | 257 | |
| Image Classification | Caltech-101 | Accuracy91.3 | 198 | |
| Image Classification | ImageNet | Accuracy84.1 | 184 | |
| Image Classification | Describable Textures | Accuracy72.5 | 41 | |
| Image Classification | UCM | Accuracy97.7 | 14 | |
| Image Classification | ImageNet ReaL 6 (val) | Accuracy0.91 | 7 |