Efficient Test-Time Scaling for Small Vision-Language Models
About
Small Vision-Language Models (VLMs) provide a computationally efficient alternative to larger models, at the cost of weaker generalization abilities and downstream task performance. These shortcomings could be addressed by test-time scaling techniques, but existing methods are typically computationally demanding, contradicting the resource-efficient design goals of small models. To address these limitations, we propose two novel and efficient test-time scaling strategies that leverage the model-internal features rather than external supervision: (i) Test-Time Augmentation (TTAug), which generates multiple augmented inputs and aggregates outputs at the token level without parameter updates, and (ii) Test-Time Adaptation (TTAdapt), which adapts model parameters during inference using consensus-based pseudolabels from TTAug. Through extensive experiments across nine benchmarks, we demonstrate consistent performance improvements while maintaining computational efficiency suitable for resource-constrained environments. The generality of our approach is demonstrated both within models at different scales and across different VLMs without additional tuning.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | TextVQA | Accuracy74.2 | 1117 | |
| Text-based Visual Question Answering | TextVQA | -- | 496 | |
| Visual Question Answering | GQA | Accuracy13.5 | 374 | |
| Visual Question Answering | ChartQA | Accuracy76.7 | 239 | |
| Chart Question Answering | ChartQA | -- | 229 | |
| Diagram Question Answering | AI2D | AI2D Accuracy68.8 | 196 | |
| Visual Question Answering | AI2D | Accuracy69.7 | 174 | |
| Diagram Understanding | AI2D | Accuracy69.7 | 167 | |
| Optical Character Recognition Benchmarking | OCRBench | Accuracy73.7 | 109 | |
| Hallucination Evaluation | AMBER | -- | 71 |