Dragonfly: Multi-Resolution Zoom-In Encoding Enhances Vision-Language Models
About
Recent advances in vision-language models (VLMs) have demonstrated the advantages of processing images at higher resolutions and utilizing multi-crop features to preserve native resolution details. However, despite these improvements, existing vision transformers (ViTs) still struggle to capture fine-grained details from less prominent objects, charts, and embedded text, limiting their effectiveness in certain tasks. In this paper, we extend recent high-resolution and multi-crop techniques by not only preserving the native resolution, but zooming in beyond it and extracting features from a large number of image sub-crops. This enhancement allows our model to better capture fine-grained details, overcoming the limitations of current ViTs. To manage the increased token count and computational complexity, we demonstrate that a simple mean-pooling aggregation over tokens is effective. Our model, Dragonfly, achieves competitive performance on general-domain tasks such as ScienceQA and AI2D, and excels in tasks requiring fine-grained image understanding, including TextVQA and ChartQA. Among models in the 7-8B parameter range, Dragonfly consistently ranks at the top across ten general-domain benchmarks, achieving the highest or second-highest scores in most cases, outperforming models that are significantly larger or trained on larger datasets. Our biomedical model, Dragonfly-Med, sets new benchmarks on several medical tasks, achieving 91.6% accuracy on SLAKE (compared to 84.8% for Med-Gemini), a 67.1% token F1 score on Path-VQA (compared to 62.7% for Med-PaLM M), and state-of-the-art results across the majority of image captioning tasks. Overall, our work highlights the persistent challenge of engineering visual representations with fixed-resolution ViTs, and proposes a simple yet effective solution to address this issue and boost performance in both general and specialized domains.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | VQA v2 | Accuracy81 | 1165 | |
| Visual Question Answering | TextVQA | Accuracy73.6 | 1117 | |
| Visual Question Answering | VizWiz | Accuracy59 | 1043 | |
| Object Hallucination Evaluation | POPE | Accuracy87.9 | 935 | |
| Multimodal Evaluation | MME | Score1.54e+3 | 557 | |
| Visual Question Answering | ChartQA | Accuracy71.2 | 239 | |
| Visual Question Answering | ScienceQA | Accuracy79.5 | 210 | |
| Diagram Question Answering | AI2D | AI2D Accuracy67.9 | 196 | |
| Multimodal Model Evaluation | MMBench | Accuracy71.9 | 180 | |
| Visual Question Answering | VQAv2 | Accuracy81 | 177 |