CARES: Context-Aware Resolution Selector for VLMs
About
Large vision-language models (VLMs) commonly process images at native or high resolution to remain effective across tasks. This inflates visual tokens ofter to 97-99% of total tokens, resulting in high compute and latency, even when low-resolution images would suffice. We introduce \emph{CARES}-a \textbf{C}ontext-\textbf{A}ware \textbf{R}esolution \textbf{S}elector, a lightweight preprocessing module that, given an image-query pair, predicts the \emph{minimal} sufficient input resolution. CARES uses a compact VLM (350M) to extract features and predict when a target pretrained VLM's response converges to its peak ability to answer correctly. Though trained as a discrete classifier over a set of optional resolutions, CARES interpolates continuous resolutions at inference for fine-grained control. Across five multimodal benchmarks spanning documents and natural images, as well as diverse target VLMs, CARES preserves task performance while reducing compute by up to 80%.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Chart Question Answering | ChartQA | -- | 356 | |
| Document Visual Question Answering | DocVQA | -- | 263 | |
| Diagram Question Answering | AI2D | -- | 232 | |
| Massive Multi-discipline Multimodal Understanding | MMMU | -- | 152 | |
| OCR Performance Evaluation | OCRBench | Score85 | 63 | |
| Information Visual Question Answering | InfoVQA | Accuracy84 | 52 | |
| Real-world Visual Understanding | RealworldQA | Accuracy79 | 47 | |
| Mathematical Visual Question Answering | MathVista | Accuracy74 | 47 | |
| Aggregate Multimodal Performance | Average | Score80 | 10 | |
| Multimodal Understanding | SeedBench-2 | Score79 | 10 |