Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Empirical Recipes for Efficient and Compact Vision-Language Models

About

Deploying vision-language models (VLMs) in resource-constrained settings demands low latency and high throughput, yet existing compact VLMs often fall short of the inference speedups their smaller parameter counts suggest. To explain this discrepancy, we conduct an empirical end-to-end efficiency analysis and systematically profile inference to identify the dominant bottlenecks. Based on these findings, we develop optimization recipes tailored to compact VLMs that substantially reduce latency while preserving accuracy. These techniques cut time to first token (TTFT) by 53% on InternVL3-2B and by 93% on SmolVLM-256M. Our recipes are broadly applicable across both VLM architectures and common serving frameworks, providing practical guidance for building efficient VLM systems. Beyond efficiency, we study how to extend compact VLMs with structured perception outputs and introduce the resulting model family, ArgusVLM. Across diverse benchmarks, ArgusVLM achieves strong performance while maintaining a compact and efficient design.

Jiabo Huang, Zhizhong Li, Sina Sajadmanesh, Weiming Zhuang, Lingjuan Lyu• 2026

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVQA v2
Accuracy80.7
1362
Visual Question AnsweringPOPE
Accuracy89.3
102
Visual Question AnsweringGQA
Exact Match63.3
13
Image CaptioningCOCO 2017
BLEU-442.2
9
Image CaptioningNoCaps
BLEU-447.7
9
Showing 5 of 5 rows

Other info

Follow for update