Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

About

In this work, we introduce Mini-Gemini, a simple and effective framework enhancing multi-modality Vision Language Models (VLMs). Despite the advancements in VLMs facilitating basic visual dialog and reasoning, a performance gap persists compared to advanced models like GPT-4 and Gemini. We try to narrow the gap by mining the potential of VLMs for better performance and any-to-any workflow from three aspects, i.e., high-resolution visual tokens, high-quality data, and VLM-guided generation. To enhance visual tokens, we propose to utilize an additional visual encoder for high-resolution refinement without increasing the visual token count. We further construct a high-quality dataset that promotes precise image comprehension and reasoning-based generation, expanding the operational scope of current VLMs. In general, Mini-Gemini further mines the potential of VLMs and empowers current frameworks with image understanding, reasoning, and generation simultaneously. Mini-Gemini supports a series of dense and MoE Large Language Models (LLMs) from 2B to 34B. It is demonstrated to achieve leading performance in several zero-shot benchmarks and even surpasses the developed private models. Code and models are available at https://github.com/dvlab-research/MiniGemini.

Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, Jiaya Jia• 2024

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVQA v2
Accuracy81.5
1165
Visual Question AnsweringTextVQA
Accuracy70.2
1117
Visual Question AnsweringVizWiz
Accuracy57.2
1043
Visual Question AnsweringGQA
Accuracy64.5
963
Object Hallucination EvaluationPOPE
Accuracy88.1
935
Multimodal EvaluationMME
Score1.92e+3
557
Text-based Visual Question AnsweringTextVQA
Accuracy70.2
496
Multimodal UnderstandingMM-Vet
MM-Vet Score59.3
418
Visual Question AnsweringGQA
Accuracy60.7
374
Multimodal UnderstandingMMBench
Accuracy68.6
367
Showing 10 of 98 rows
...

Other info

Code

Follow for update