Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
About
In this work, we introduce Mini-Gemini, a simple and effective framework enhancing multi-modality Vision Language Models (VLMs). Despite the advancements in VLMs facilitating basic visual dialog and reasoning, a performance gap persists compared to advanced models like GPT-4 and Gemini. We try to narrow the gap by mining the potential of VLMs for better performance and any-to-any workflow from three aspects, i.e., high-resolution visual tokens, high-quality data, and VLM-guided generation. To enhance visual tokens, we propose to utilize an additional visual encoder for high-resolution refinement without increasing the visual token count. We further construct a high-quality dataset that promotes precise image comprehension and reasoning-based generation, expanding the operational scope of current VLMs. In general, Mini-Gemini further mines the potential of VLMs and empowers current frameworks with image understanding, reasoning, and generation simultaneously. Mini-Gemini supports a series of dense and MoE Large Language Models (LLMs) from 2B to 34B. It is demonstrated to achieve leading performance in several zero-shot benchmarks and even surpasses the developed private models. Code and models are available at https://github.com/dvlab-research/MiniGemini.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | VQA v2 | Accuracy81.5 | 1165 | |
| Visual Question Answering | TextVQA | Accuracy70.2 | 1117 | |
| Visual Question Answering | VizWiz | Accuracy57.2 | 1043 | |
| Visual Question Answering | GQA | Accuracy64.5 | 963 | |
| Object Hallucination Evaluation | POPE | Accuracy88.1 | 935 | |
| Multimodal Evaluation | MME | Score1.92e+3 | 557 | |
| Text-based Visual Question Answering | TextVQA | Accuracy70.2 | 496 | |
| Multimodal Understanding | MM-Vet | MM-Vet Score59.3 | 418 | |
| Visual Question Answering | GQA | Accuracy60.7 | 374 | |
| Multimodal Understanding | MMBench | Accuracy68.6 | 367 |