Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Efficient Test-Time Scaling for Small Vision-Language Models

About

Small Vision-Language Models (VLMs) provide a computationally efficient alternative to larger models, at the cost of weaker generalization abilities and downstream task performance. These shortcomings could be addressed by test-time scaling techniques, but existing methods are typically computationally demanding, contradicting the resource-efficient design goals of small models. To address these limitations, we propose two novel and efficient test-time scaling strategies that leverage the model-internal features rather than external supervision: (i) Test-Time Augmentation (TTAug), which generates multiple augmented inputs and aggregates outputs at the token level without parameter updates, and (ii) Test-Time Adaptation (TTAdapt), which adapts model parameters during inference using consensus-based pseudolabels from TTAug. Through extensive experiments across nine benchmarks, we demonstrate consistent performance improvements while maintaining computational efficiency suitable for resource-constrained environments. The generality of our approach is demonstrated both within models at different scales and across different VLMs without additional tuning.

Mehmet Onurcan Kaya, Desmond Elliott, Dim P. Papadopoulos• 2025

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringTextVQA
Accuracy74.2
1117
Text-based Visual Question AnsweringTextVQA--
496
Visual Question AnsweringGQA
Accuracy13.5
374
Visual Question AnsweringChartQA
Accuracy76.7
239
Chart Question AnsweringChartQA--
229
Diagram Question AnsweringAI2D
AI2D Accuracy68.8
196
Visual Question AnsweringAI2D
Accuracy69.7
174
Diagram UnderstandingAI2D
Accuracy69.7
167
Optical Character Recognition BenchmarkingOCRBench
Accuracy73.7
109
Hallucination EvaluationAMBER--
71
Showing 10 of 17 rows

Other info

Follow for update