Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

About

In this work, we introduce Mini-Gemini, a simple and effective framework enhancing multi-modality Vision Language Models (VLMs). Despite the advancements in VLMs facilitating basic visual dialog and reasoning, a performance gap persists compared to advanced models like GPT-4 and Gemini. We try to narrow the gap by mining the potential of VLMs for better performance and any-to-any workflow from three aspects, i.e., high-resolution visual tokens, high-quality data, and VLM-guided generation. To enhance visual tokens, we propose to utilize an additional visual encoder for high-resolution refinement without increasing the visual token count. We further construct a high-quality dataset that promotes precise image comprehension and reasoning-based generation, expanding the operational scope of current VLMs. In general, Mini-Gemini further mines the potential of VLMs and empowers current frameworks with image understanding, reasoning, and generation simultaneously. Mini-Gemini supports a series of dense and MoE Large Language Models (LLMs) from 2B to 34B. It is demonstrated to achieve leading performance in several zero-shot benchmarks and even surpasses the developed private models. Code and models are available at https://github.com/dvlab-research/MiniGemini.

Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, Jiaya Jia• 2024

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy88.1	2056
Visual Question Answering	VizWiz	Accuracy57.2	1863
Visual Question Answering	TextVQA	Accuracy70.2	1455
Visual Question Answering	GQA	Accuracy64.5	1445
Visual Question Answering	VQA v2	Accuracy81.5	1429
Text-based Visual Question Answering	TextVQA	Accuracy70.2	984
Science Question Answering	ScienceQA	Accuracy75.1	916
Multimodal Evaluation	MME	Score1.92e+3	902
Multimodal Understanding	MMBench	Accuracy68.6	887
Multimodal Understanding	MM-Vet	MM-Vet Score59.3	664

Showing 10 of 112 rows

...

Other info

Code

Follow for update

@wizwand_team Discord