Dragonfly: Multi-Resolution Zoom-In Encoding Enhances Vision-Language Models

About

Recent advances in vision-language models (VLMs) have demonstrated the advantages of processing images at higher resolutions and utilizing multi-crop features to preserve native resolution details. However, despite these improvements, existing vision transformers (ViTs) still struggle to capture fine-grained details from less prominent objects, charts, and embedded text, limiting their effectiveness in certain tasks. In this paper, we extend recent high-resolution and multi-crop techniques by not only preserving the native resolution, but zooming in beyond it and extracting features from a large number of image sub-crops. This enhancement allows our model to better capture fine-grained details, overcoming the limitations of current ViTs. To manage the increased token count and computational complexity, we demonstrate that a simple mean-pooling aggregation over tokens is effective. Our model, Dragonfly, achieves competitive performance on general-domain tasks such as ScienceQA and AI2D, and excels in tasks requiring fine-grained image understanding, including TextVQA and ChartQA. Among models in the 7-8B parameter range, Dragonfly consistently ranks at the top across ten general-domain benchmarks, achieving the highest or second-highest scores in most cases, outperforming models that are significantly larger or trained on larger datasets. Our biomedical model, Dragonfly-Med, sets new benchmarks on several medical tasks, achieving 91.6% accuracy on SLAKE (compared to 84.8% for Med-Gemini), a 67.1% token F1 score on Path-VQA (compared to 62.7% for Med-PaLM M), and state-of-the-art results across the majority of image captioning tasks. Overall, our work highlights the persistent challenge of engineering visual representations with fixed-resolution ViTs, and proposes a simple yet effective solution to address this issue and boost performance in both general and specialized domains.

Rahul Thapa, Kezhen Chen, Ian Covert, Rahul Chalamala, Ben Athiwaratkun, Shuaiwen Leon Song, James Zou• 2024

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy87.9	2019
Visual Question Answering	VizWiz	Accuracy59	1820
Visual Question Answering	TextVQA	Accuracy73.6	1453
Visual Question Answering	VQA v2	Accuracy81	1429
Multimodal Evaluation	MME	Score1.54e+3	727
Visual Question Answering	ChartQA	Accuracy71.2	519
Visual Question Answering	ScienceQA	Accuracy79.5	446
Diagram Question Answering	AI2D	AI2D Accuracy67.9	387
Visual Question Answering	AI2D	Accuracy67.9	317
Science Question Answering	ScienceQA (SQA)	Accuracy79.7	273

Showing 10 of 25 rows

Other info

Code

Follow for update

@wizwand_team Discord