Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

CARES: Context-Aware Resolution Selector for VLMs

About

Large vision-language models (VLMs) commonly process images at native or high resolution to remain effective across tasks. This inflates visual tokens ofter to 97-99% of total tokens, resulting in high compute and latency, even when low-resolution images would suffice. We introduce \emph{CARES}-a \textbf{C}ontext-\textbf{A}ware \textbf{R}esolution \textbf{S}elector, a lightweight preprocessing module that, given an image-query pair, predicts the \emph{minimal} sufficient input resolution. CARES uses a compact VLM (350M) to extract features and predict when a target pretrained VLM's response converges to its peak ability to answer correctly. Though trained as a discrete classifier over a set of optional resolutions, CARES interpolates continuous resolutions at inference for fine-grained control. Across five multimodal benchmarks spanning documents and natural images, as well as diverse target VLMs, CARES preserves task performance while reducing compute by up to 80%.

Moshe Kimhi, Nimrod Shabtay, Raja Giryes, Chaim Baskin, Eli Schwartz• 2025

Related benchmarks

TaskDatasetResultRank
Chart Question AnsweringChartQA--
356
Document Visual Question AnsweringDocVQA--
263
Diagram Question AnsweringAI2D--
232
Massive Multi-discipline Multimodal UnderstandingMMMU--
152
OCR Performance EvaluationOCRBench
Score85
63
Information Visual Question AnsweringInfoVQA
Accuracy84
52
Real-world Visual UnderstandingRealworldQA
Accuracy79
47
Mathematical Visual Question AnsweringMathVista
Accuracy74
47
Aggregate Multimodal PerformanceAverage
Score80
10
Multimodal UnderstandingSeedBench-2
Score79
10
Showing 10 of 10 rows

Other info

Follow for update