CARES: Context-Aware Resolution Selector for VLMs

About

Large vision-language models (VLMs) commonly process images at native or high resolution to remain effective across tasks. This inflates visual tokens ofter to 97-99% of total tokens, resulting in high compute and latency, even when low-resolution images would suffice. We introduce \emph{CARES}-a \textbf{C}ontext-\textbf{A}ware \textbf{R}esolution \textbf{S}elector, a lightweight preprocessing module that, given an image-query pair, predicts the \emph{minimal} sufficient input resolution. CARES uses a compact VLM (350M) to extract features and predict when a target pretrained VLM's response converges to its peak ability to answer correctly. Though trained as a discrete classifier over a set of optional resolutions, CARES interpolates continuous resolutions at inference for fine-grained control. Across five multimodal benchmarks spanning documents and natural images, as well as diverse target VLMs, CARES preserves task performance while reducing compute by up to 80%.

Moshe Kimhi, Nimrod Shabtay, Raja Giryes, Chaim Baskin, Eli Schwartz• 2025

Related benchmarks

Task	Dataset	Result
Diagram Question Answering	AI2D	--	387
Chart Question Answering	ChartQA	--	371
Document Visual Question Answering	DocVQA	--	301
Massive Multi-discipline Multimodal Understanding	MMMU	--	216
Real-world Visual Understanding	RealworldQA	Accuracy79	110
Information Visual Question Answering	InfoVQA	Accuracy84	110
Mathematical Visual Question Answering	MathVista	Accuracy74	87
OCR Performance Evaluation	OCRBench	Score85	68
Aggregate Multimodal Performance	Average	Score80	10
Multimodal Understanding	SeedBench-2	Score79	10

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord