Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

About

In this work, we introduce Mini-Gemini, a simple and effective framework enhancing multi-modality Vision Language Models (VLMs). Despite the advancements in VLMs facilitating basic visual dialog and reasoning, a performance gap persists compared to advanced models like GPT-4 and Gemini. We try to narrow the gap by mining the potential of VLMs for better performance and any-to-any workflow from three aspects, i.e., high-resolution visual tokens, high-quality data, and VLM-guided generation. To enhance visual tokens, we propose to utilize an additional visual encoder for high-resolution refinement without increasing the visual token count. We further construct a high-quality dataset that promotes precise image comprehension and reasoning-based generation, expanding the operational scope of current VLMs. In general, Mini-Gemini further mines the potential of VLMs and empowers current frameworks with image understanding, reasoning, and generation simultaneously. Mini-Gemini supports a series of dense and MoE Large Language Models (LLMs) from 2B to 34B. It is demonstrated to achieve leading performance in several zero-shot benchmarks and even surpasses the developed private models. Code and models are available at https://github.com/dvlab-research/MiniGemini.

Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, Jiaya Jia• 2024

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE
Accuracy88.1
2019
Visual Question AnsweringVizWiz
Accuracy57.2
1820
Visual Question AnsweringTextVQA
Accuracy70.2
1453
Visual Question AnsweringVQA v2
Accuracy81.5
1429
Visual Question AnsweringGQA
Accuracy64.5
1425
Text-based Visual Question AnsweringTextVQA
Accuracy70.2
962
Multimodal UnderstandingMMBench
Accuracy68.6
847
Science Question AnsweringScienceQA
Accuracy75.1
791
Multimodal EvaluationMME
Score1.92e+3
727
Multimodal UnderstandingMM-Vet
MM-Vet Score59.3
631
Showing 10 of 112 rows
...

Other info

Code

Follow for update